October 24, 2012
Quick notes on Impala
Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that:
- Impala is free open-source code.
- Not in time for the Impala beta, but planned in time for Impala’s general availability is a column-oriented storage option called Trevni. Impala’s best performance will generally come over Trevni.
- Trevni has a variety of block-level compression options. (64 Kb block size.) Columnar compression, especially dictionary, is a roadmap item.
- Support for nested data structures is a roadmap item, both via Trevni and Avro, except that some limited support may be available via Trevni at GA.
- It obviously will be quite a while before Impala or Hadapt have cost-based optimizers (as opposed to rule-based/heuristic). My unsubstantiated guess is that this is more of a problem for complex queries than simple ones.
On the whole, Impala seems less mature or capable than Hadapt. But Impala does have a few countervailing advantages:
- It’s one less thing to pay for.
- It’s one less thing to administer. (Assuming its integration into Hadoop is tight enough to make that true.)
- It could be faster in some use cases (because it will have columnar storage sooner).
Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration
Subscribe to our complete feed!
Comments
6 Responses to “Quick notes on Impala”
Leave a Reply
[…] talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested […]
[…] Stay tuned for more on Cloudera Impala. For one thing, I didn’t realize it would run over HBase as well as HDFS right out of the […]
[…] What I wrote before about Cloudera Impala was woefully incomplete. After a followup call, I now feel I have a better handle on the whole thing. […]
[…] analytics. There are many examples in the Hadoop world — including the recent wave of SQL add-ons to Hadoop — and some in the graph area as well. But those choices will rarely suffice for the whole […]
Hive is not good for interactive ( < 20 second ) queries, but it has had columnar storage for quite some time. Also, people who are doing large joins in Hive should learn about the various flavors of map-side joins since they are quite fast as they eliminate the reduction (slow) phase
I don’t think Hive has true columnar storage; I thought rcfile was more PAX-like.