Impala and Parquet
I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:
- Impala is meant to someday be a competitive MPP (Massively Parallel Processing) analytic RDBMS.
- At the moment, it is not one. For example, Impala lacks any meaningful form of workload management or query optimization.
- While Impala will run against any HDFS (Hadoop Distributed File System) file format, claims of strong performance assume that the data is in Parquet …
- … which is the replacement for the short-lived Trevni …
- … and which for most practical purposes is true columnar.
- Impala is also meant to be more than an RDBMS; Parquet and presumably in the future Impala can accommodate nested data structures.
- Just as Impala runs against most or all HDFS file formats, Parquet files can be used by most Hadoop execution engines, and of course by Pig and Hive.
- The Impala roadmap includes workload management, query optimization, data skipping, user-defined functions, hash distribution, two turtledoves, and a partridge in a pear tree.
Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.
To understand Parquet, it helps to recall that in HDFS there are big blocks, and then there also are ordinary blocks. The big blocks are the 1 gigabyte units that HDFS manages. These are also at this time the closest thing HDFS has to specific storage locations that systems — e.g. a database management execution engine such as Impala — can refer to. Within these big blocks, Parquet is PAX-like; i.e., it stores entire rows in the same big block, but does so a column at a time. However, the more ordinary-sized blocks that are units of I/O should contain data only from single columns; hence, in most cases it should be possible to retrieve only the specific columns that you want. Parquet’s compression scheme is:
- One big block at time.
- By default, dictionary up to a certain cardinality (I forgot to get the figure) …
- … and RLE (Run-Length Encoding) after that.
- Bit-packing as another option.
I forgot to ask whether Impala can operate on compressed data, but based on its compression scheme I’m guessing the answer is no.
In addition to ordinary tables, Parquet can handle nested data structures, ala Dremel. That is, a field can be array-valued, a cell in the array can itself be array-valued, and so on, with arrays all the way down. (Cloudera told me that Twitter’s data is nested 9 levels deep.) If I understood correctly, none of this interferes with single-valued cells being stored in a columnar way; not coincidentally, I got the impression that at least within each big block, there’s a consistent schema.
As for Impala joins and so on:
- Impala does in-memory hash joins …
- … on either a broadcast or partition (sort/merge) basis.
- Choosing between join algorithms is the one thing Impala’s optimizer can do. (The rest is waiting on better stats/metadata in the base files.)
- Joins are fully parallelized when it makes sense. Any Impala daemon/node can be in charge of client communication for a particular query; otherwise, there’s no special “head” node.
- Different tables can have different replication factors. So in particular, you can replicate a small table to every node.
Other notes on Impala and Parquet include:
- Cloudera said that a total of ~1300 organizations have downloaded Impala, and at least ~50 of them are showing strong evidence of some kind of use (e.g., filing tickets about it).
- The main contributors to Parquet to date are Cloudera, Twitter and a French firm called Criteo.
- Impala does most of SQL 92, but not correlated sub-queries, for which there isn’t much user demand anyway. (I think of correlated sub-queries as something you need to run SAP or PeopleSoft enterprise apps.)
- Impala compiles queries into “assembly language” at run time.
And finally: When I wrote about how hard it is to develop a new DBMS, Impala was the top example I had in mind. I continue to think that Cloudera has a good understanding of the generalities of what it needs to add to Impala, as is demonstrated by them allowing me to list some of the many Impala roadmap items above. But I also think Cloudera has an incomplete appreciation of just how hard some of those development efforts will turn out to be.
Related links
- Dan Abadi and Dave DeWitt recently contributed observations about SQL-Hadoop architectures.
- DBMS concurrency is a classic case of Bottleneck Whack-A-Mole.
Comments
11 Responses to “Impala and Parquet”
Leave a Reply
[…] a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less […]
I see more than a few correlated subqueries run on MySQL, I assume these are more popular elsewhere and those queries will benefit from a good optimizer. How many of the HDFS-based query processing frameworks need a lot of optimizer work?
Hi Mark,
I agree that to some extent the absence of a lot of correlated subqueries is caused by the lack of support for them. However, I also want to point out that this is not high on the list of features that are frequently requested by our customers.
Regarding your (rhetorical?) question about optimizers: cost-based optimization techniques would no doubt also be useful in the context of Hadoop query engines; more of that is certainly on the roadmap for Impala.
But the absence of this feature also has something to do with the absence of traditional table statistics. Gathering those in a Hadoop environment is a bit more challenging than in your average RDBMS, because there is no single gatekeeper through which all incoming data needs to be funneled. This is not an insurmountable problem, but it illustrates that we need to do more than just “reinvent the wheel”.
Curt, you start out by saying that “Impala is meant to someday be a competitive MPP analytic RDBMS”, but that is not why Impala was created. The goal of Impala is to provide general SQL querying capability for the Hadoop ecosystem in an efficient manner. To that end, it already exceeds the capabilities of your average analytic DBMS in some aspects: you can combine data stored in multiple physical formats into a single logical table and query it efficiently with Impala. This matters, because it doesn’t force users to arrange everything around a single, central DBMS; instead users can create data using whatever framework and storage format is most appropriate for the task at hand. Having Parquet available means that your data can eventually be transformed into the most efficient physical format, but you can start querying it as soon as it shows up as, say, a csv file written by your web app.
Whether Impala in its current shape is “immature” is a matter of opinion, since it depends on a frame of reference. But the fact that we have more than just handful of customers who are actively submitting support tickets for it shows that it already provides useful functionality.
Marcel,
I get and support the Innovator’s Dilemma suggestion that you’re doing something different from the MPP RDBMS it will take you a long time to catch up with. And I thank you for spelling it out so emphatically.
Evsn so, I stand by what I wrote. 🙂
[…] Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet. […]
[…] — Hive, Impala, Stinger, Shark and so on (including […]
[…] Impala-loving Cloudera doesn’t plan to support Shark. Duh. […]
[…] Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.) […]
Does Impala support ORC file format? why not? What is the difference between ORC & Parquet?
Latest version of HIVE (0.14 & 1.0.0) supports updates/deletes if the HIVE table is in ORC format.
It’s hard to find educated people for this topic, but you sound like you
know what you’re talking about! Thanks