October 24, 2012

Quick notes on Impala

Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.

In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:

Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
Unlike Hive, Impala does not work through Hadoop MapReduce.
Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).

Beyond that:

Impala is free open-source code.
Not in time for the Impala beta, but planned in time for Impala’s general availability is a column-oriented storage option called Trevni. Impala’s best performance will generally come over Trevni.
Trevni has a variety of block-level compression options. (64 Kb block size.) Columnar compression, especially dictionary, is a roadmap item.
Support for nested data structures is a roadmap item, both via Trevni and Avro, except that some limited support may be available via Trevni at GA.
It obviously will be quite a while before Impala or Hadapt have cost-based optimizers (as opposed to rule-based/heuristic). My unsubstantiated guess is that this is more of a problem for complex queries than simple ones.

On the whole, Impala seems less mature or capable than Hadapt. But Impala does have a few countervailing advantages:

It’s one less thing to pay for.
It’s one less thing to administer. (Assuming its integration into Hadoop is tight enough to make that true.)
It could be faster in some use cases (because it will have columnar storage sooner).

Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration

Subscribe to our complete feed!

Comments

6 Responses to “Quick notes on Impala”

Notes on Hadoop hardware | DBMS 2 : DataBase Management System Services on October 31st, 2012 3:27 am

[…] talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested […]
Notes and comments — October 31, 2012 | DBMS 2 : DataBase Management System Services on October 31st, 2012 11:37 am

[…] Stay tuned for more on Cloudera Impala. For one thing, I didn’t realize it would run over HBase as well as HDFS right out of the […]
More on Cloudera Impala | DBMS 2 : DataBase Management System Services on November 1st, 2012 7:12 am

[…] What I wrote before about Cloudera Impala was woefully incomplete. After a followup call, I now feel I have a better handle on the whole thing. […]
Do you need an analytic RDBMS? | DBMS 2 : DataBase Management System Services on November 14th, 2012 9:21 pm

[…] analytics. There are many examples in the Hadoop world — including the recent wave of SQL add-ons to Hadoop — and some in the graph area as well. But those choices will rarely suffice for the whole […]
Phillip W Young on May 28th, 2013 11:47 am

Hive is not good for interactive ( < 20 second ) queries, but it has had columnar storage for quite some time. Also, people who are doing large joins in Hive should learn about the various flavors of map-side joins since they are quite fast as they eliminate the reduction (slow) phase
Curt Monash on May 28th, 2013 4:11 pm

I don’t think Hive has true columnar storage; I thought rcfile was more PAX-like.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Quick notes on Impala

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin