April 30, 2014
Cloudera, Impala, data warehousing and Hive
There’s much confusion about Cloudera’s SQL plans and beliefs, and the company has mainly itself to blame. That said, here’s what I think is going on.
- Hive is good at some tasks and terrible at others.
- Hive is good at batch data transformation.
- Hive is bad at ad-hoc query, unless you really, really need Hive’s scale and low license cost. One example, per Eli Collins: Facebook has a 500 petabyte Hive warehouse, but jokes that on a good day an analyst can run 6 queries against it.
- Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.)
- Impala is also meant to be good at what Hive is good at, and will someday from Cloudera’s standpoint completely supersede Hive, but Cloudera is in no hurry for that day to arrive. Hive is more mature. Hive still has more SQL coverage than Impala. There’s a lot of legacy investment in Hive. Cloudera gets little business advantage if a customer sunsets Hive.
- Impala is already decent at some tasks analytic RDBMS are commonly used for. Cloudera insists that some queries run very quickly on Impala. I believe them.
- Impala is terrible at others, including some of the ones most closely associated with the concept of “data warehousing”. Data modeling is a big zero right now. Impala’s workload management, concurrency and all that are very immature.
- There are some use cases for which SQL-on-Hadoop blows away analytic RDBMS, for example ones involving data transformations – perhaps on multi-structured data – that are impractical in RDBMS.
And of course, as vendors so often do, Cloudera generally overrates both the relative maturity of Impala and the relative importance of the use cases in which its offerings – Impala or otherwise – shine.
Related links
- A survey of SQL/Hadoop integration (February, 2014)
- The cardinal rules of DBMS development (March, 2013)
Categories: Cloudera, Data warehousing, Facebook, Hadoop, SQL/Hadoop integration, Workload management
Subscribe to our complete feed!
Comments
4 Responses to “Cloudera, Impala, data warehousing and Hive”
Leave a Reply
Does Impala do index nested loops joins with HBase?
About impala performance, we ran our own comparison against Vertica: http://baboonit.be/blog/measuring-vertica-performance-with-tpc-ds
For the results they published, impala is actually quite fast. They didn’t publish concurrency results.
Mark: Impala does not yet support index nested loop joins against HBase, but it’s something we’re looking at. No release date yet, though.
There is another SQL implementation on HBASE by name of Splice Machine. They provide ACID compliance. After downloading the product on single node and trying for simple testing was completely disappointed by what splice machine has to offer. First, it is SQL on Hbase (not sure why does that make it a RDBMS) or SQL on Hadoop i.e. lots of marketing in their material.
Secondly, product is using Derby for the front end SQL layer and most of the execution is extremely slow as lots of operations (like Joins) happen in the JVMs. SQL isn’t complete, you need an Oracle Expert from early 1980s to make Splice machine work. Query performance even on Single node on small dataset was below par. Transaction thru put doesn’t keep up with likes of VoltDB or NoSQL.
Will be downloading Impala and give it a shot next.