June 23, 2013

Impala and Parquet

I visited Cloudera Friday for, among other things, a chat about Impala with Marcel Kornacker and colleagues. Highlights included:

Data gets into Parquet via batch jobs only — one reason it’s important that Impala run against multiple file formats — but background format conversion is another roadmap item. A single table can be split across multiple formats — e.g., the freshest data could be in HBase, with the rest is in Parquet.

To understand Parquet, it helps to recall that in HDFS there are big blocks, and then there also are ordinary blocks. The big blocks are the 1 gigabyte units that HDFS manages. These are also at this time the closest thing HDFS has to specific storage locations that systems — e.g. a database management execution engine such as Impala — can refer to. Within these big blocks, Parquet is PAX-like; i.e., it stores entire rows in the same big block, but does so a column at a time. However, the more ordinary-sized blocks that are units of I/O should contain data only from single columns; hence, in most cases it should be possible to retrieve only the specific columns that you want. Parquet’s compression scheme is:

I forgot to ask whether Impala can operate on compressed data, but based on its compression scheme I’m guessing the answer is no.

In addition to ordinary tables, Parquet can handle nested data structures, ala Dremel. That is, a field can be array-valued, a cell in the array can itself be array-valued, and so on, with arrays all the way down. (Cloudera told me that Twitter’s data is nested 9 levels deep.) If I understood correctly, none of this interferes with single-valued cells being stored in a columnar way; not coincidentally, I got the impression that at least within each big block, there’s a consistent schema.

As for Impala joins and so on:

Other notes on Impala and Parquet include:

And finally: When I wrote about how hard it is to develop a new DBMS, Impala was the top example I had in mind. I continue to think that Cloudera has a good understanding of the generalities of what it needs to add to Impala, as is demonstrated by them allowing me to list some of the many Impala roadmap items above. But I also think Cloudera has an incomplete appreciation of just how hard some of those development efforts will turn out to be.

Related links

Comments

11 Responses to “Impala and Parquet”

  1. Hadoop news and rumors, June 23, 2013 | DBMS 2 : DataBase Management System Services on June 23rd, 2013 11:51 pm

    […] a variety of HDFS file formats. While some formats may be optimized for specific engines – e.g. Parquet for Impala – anything can work with more or less […]

  2. Mark Callaghan on June 24th, 2013 11:59 am

    I see more than a few correlated subqueries run on MySQL, I assume these are more popular elsewhere and those queries will benefit from a good optimizer. How many of the HDFS-based query processing frameworks need a lot of optimizer work?

  3. Marcel Kornacker on June 24th, 2013 11:58 pm

    Hi Mark,

    I agree that to some extent the absence of a lot of correlated subqueries is caused by the lack of support for them. However, I also want to point out that this is not high on the list of features that are frequently requested by our customers.

    Regarding your (rhetorical?) question about optimizers: cost-based optimization techniques would no doubt also be useful in the context of Hadoop query engines; more of that is certainly on the roadmap for Impala.

    But the absence of this feature also has something to do with the absence of traditional table statistics. Gathering those in a Hadoop environment is a bit more challenging than in your average RDBMS, because there is no single gatekeeper through which all incoming data needs to be funneled. This is not an insurmountable problem, but it illustrates that we need to do more than just “reinvent the wheel”.

  4. Marcel Kornacker on June 25th, 2013 2:27 am

    Curt, you start out by saying that “Impala is meant to someday be a competitive MPP analytic RDBMS”, but that is not why Impala was created. The goal of Impala is to provide general SQL querying capability for the Hadoop ecosystem in an efficient manner. To that end, it already exceeds the capabilities of your average analytic DBMS in some aspects: you can combine data stored in multiple physical formats into a single logical table and query it efficiently with Impala. This matters, because it doesn’t force users to arrange everything around a single, central DBMS; instead users can create data using whatever framework and storage format is most appropriate for the task at hand. Having Parquet available means that your data can eventually be transformed into the most efficient physical format, but you can start querying it as soon as it shows up as, say, a csv file written by your web app.

    Whether Impala in its current shape is “immature” is a matter of opinion, since it depends on a frame of reference. But the fact that we have more than just handful of customers who are actively submitting support tickets for it shows that it already provides useful functionality.

  5. Curt Monash on June 25th, 2013 2:44 am

    Marcel,

    I get and support the Innovator’s Dilemma suggestion that you’re doing something different from the MPP RDBMS it will take you a long time to catch up with. And I thank you for spelling it out so emphatically.

    Evsn so, I stand by what I wrote. 🙂

  6. Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 7th, 2013 2:52 am

    […] Providing a Hive-friendly HDFS file format, called ORC. To a first approximation, ORC sounds a lot like Cloudera Impala’s preferred format Parquet. […]

  7. Layering of database technology & DBMS with multiple DMLs | DBMS 2 : DataBase Management System Services on January 31st, 2014 9:04 am

    […] — Hive, Impala, Stinger, Shark and so on (including […]

  8. Spark and Databricks | DBMS 2 : DataBase Management System Services on February 2nd, 2014 4:09 pm

    […] Impala-loving Cloudera doesn’t plan to support Shark. Duh. […]

  9. Cloudera, Impala, data warehousing and Hive | DBMS 2 : DataBase Management System Services on April 30th, 2014 10:03 pm

    […] Impala is meant to be good at what Hive is bad at – i.e., fast-response query. (Cloudera mentioned reliable 100 millisecond response times for at least one user.) […]

  10. marees on June 4th, 2015 6:35 am

    Does Impala support ORC file format? why not? What is the difference between ORC & Parquet?

    Latest version of HIVE (0.14 & 1.0.0) supports updates/deletes if the HIVE table is in ORC format.

  11. software development on July 1st, 2024 7:17 pm

    It’s hard to find educated people for this topic, but you sound like you
    know what you’re talking about! Thanks

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.