More on Cloudera Impala
What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
*With no “fat head”.
Impala is of course a young system, and very much a work in progress. It has a variety of limitations in functionality, performance, and so on, many (all?) of which are slated to be addressed down the road. While different individuals may espouse different views at different times, I think it’s not too misleading to summarize Cloudera’s strategic positioning for Impala as:
- A core use case for Hadoop is to process or transform data. SQL can help with that, and hence so can Impala.
- A core use case for Hadoop is machine learning. SQL can help with that, and hence so can Impala.
- Both due to its Hadoop integration and other features, HBase is getting significant usage. You might want to do SQL against your HBase data. Impala can help with that.
- Some enterprises choose to have much large clusters for Hadoop than they do for their relational DBMS. For them, Impala may give pretty good analytic SQL performance, by throwing hardware at the problem.
Thinking about Impala performance is confusing, on any level of detail beyond:
- Impala is going to be (much) faster than Hive …
- … but slower than a serious and more mature analytic RDBMS.
But let’s try anyway.
As of the initial Impala release(s):
- Impala will run against a variety of storage managers, choices among which will have different performance implications. HDFS (Hadoop Distributed File System) and HBase will both be supported. Multiple HDFS formats will be supported, both row-based and columnar. (See the Trevni comments in my first Impala post.)
- In the simplest of scanning scenarios, Impala can read row-based data at near the theoretically optimum speed, while Hive runs at 1/3 of that.
- Initially, all Impala joins will be (distributed) hash joins. These seem to start at 10X Hive’s performance and go up from there.
- The fastest Impala queries take > 1 second.
- One test showed Impala surviving a load of 100 concurrent queries. Another test showed Impala running 10 cloned copies of a query with 25%ish performance degradation.
- Impala will have Microstrategy support on Day 1, so it obviously can handle fairly complex SQL. (Also Pentaho, Tableau, and QlikView.)
- Column statistics and the like are under active development, which will help in query optimization. A true cost-based optimizer is, of course, further off.
Cloudera’s marketing name for Impala will be “Real Time Query”, but seems a dubious match to early-release Impala reality.
In many cases, the best Impala performance — and indeed the best Hadoop performance overall — will probably come over Trevni, which Cloudera believes will be 30% or so faster than the current columnar option RCFile. This led me to inquire how data would get into Trevni, presuming that it’s initially loaded into some other format. Cloudera is hoping to have a background process for that available Day 1, but I have no details about it. (The other alternative would be to do a batch MapReduce job.) Cloudera also points out that both Flume and HBase can get data into Hadoop with very low latency.
Given the obvious potential synergy between Impala — a specialized alternative to MapReduce — and YARN, Cloudera has redoubled its efforts to (help) get YARN up to production quality.
Finally, there’s the question of what Impala actually does. In its initial release, it will support a large, strict subset of Hive functionality. That helps with reusing a lot of Hive infrastructure and connectivity, of course. But it also means that you don’t have real updates; rather, you load in bulk. Similarly, there’s a lot of analytic SQL functionality that’s not directly supported. Down the road, it’s reasonable to expect Impala functionality to extend in (at least) two directions:
- More SQL capability.
- Dremel-like capability to handle nested data structures.
Comments
12 Responses to “More on Cloudera Impala”
Leave a Reply
[…] 4. Stay tuned for more on Cloudera Impala. (Edit: Now posted.) […]
[…] There is now a follow-up post on Cloudera Impala with substantially more […]
Thanks again to your and your audience for very helpful posts and comments Curt. I also found this historical post and discussion thread useful when I was searching for additional reference material:
http://www.dbms2.com/2010/07/29/how-should-somebody-teach-themselves-programming-skills/
Regards,
Al D.
Do you have any sense of how this will stack up against Apache Drill? It’s clear that Impala is way down the development path in comparison, but I wondering if they will end up in different places.
Patrick,
I don’t know as much about Drill/Dremel as I should. More later.
[…] Monash has a writeup (although he does make it sound like no query will return in under one second, which isn’t […]
[…] Hadoop. There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers […]
[…] think of this as a big deal in complex query execution, for example as an aspect of the design of Impala or Hadapt. But it’s perhaps even more important in iterative machine learning algorithms, […]
[…] aren’t exceptions to the cardinal rules of DBMS development. That applies to Impala (Cloudera), Stinger (Hortonworks), and Hadapt, among others. Fortunately, the relevant vendors seem to be well […]
[…] SQL-H and Hadapt (October, 2012) […]
[…] There have been many good articles written about Impala since its release (see here & here), so I won’t go into its design details. I will highlight the impressive performance numbers put […]
[…] after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or […]