April 30, 2014

Cloudera, Impala, data warehousing and Hive

There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.

Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.

And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.

Related links

A survey of SQL/Hadoop integration (February, 2014)
The cardinal rules of DBMS development (March, 2013)

Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, Workload management

Subscribe to our complete feed!

Comments

4 Responses to “Cloudera, Impala, data warehousing and Hive”

Mark Callaghan on April 30th, 2014 10:55 pm

Does Impala do index nested loops joins with HBase?
Kris Peeters on May 1st, 2014 2:54 am

About impala performance, we ran our own comparison against Vertica: http://baboonit.be/blog/measuring-vertica-performance-with-tpc-ds

For the results they published, impala is actually quite fast. They didn’t publish concurrency results.
Marcel Kornacker on May 4th, 2014 9:57 pm

Mark: Impala does not yet support index nested loop joins against HBase, but it’s something we’re looking at. No release date yet, though.
John on May 19th, 2014 10:28 am

There is another SQL implementation on HBASE by name of Splice Machine. They provide ACID compliance. After downloading the product on single node and trying for simple testing was completely disappointed by what splice machine has to offer. First, it is SQL on Hbase (not sure why does that make it a RDBMS) or SQL on Hadoop i.e. lots of marketing in their material.
Secondly, product is using Derby for the front end SQL layer and most of the execution is extremely slow as lots of operations (like Joins) happen in the JVMs. SQL isn’t complete, you need an Oracle Expert from early 1980s to make Splice machine work. Query performance even on Single node on small dataset was below par. Transaction thru put doesn’t keep up with likes of VoltDB or NoSQL.
Will be downloading Impala and give it a shot next.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Cloudera, Impala, data warehousing and Hive

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin