SQL-Hadoop architectures compared
The genesis of this post is:
- Dave DeWitt sent me a paper about Microsoft Polybase.
- I argued with Dave about the differences between Polybase and Hadapt.
- I asked Daniel Abadi for his opinion.
- Dan agreed with Dave, in a long email …
- … that he graciously permitted me to lightly-edit and post.
I love my life.
Per Daniel (emphasis mine):
I basically agree with what MSFT says in the paper about how Polybase differs from Hadapt. Obviously at a high level they are similar — both systems can access data stored in Hadoop and split query processing between the DBMS and Hadoop. But once you start to get into the details, the systems diverge dramatically. Polybase is much more similar to Greenplum and Aster’s integration with Hadoop (in the way it uses external tables to connect to Hadoop) than it is to Hadapt. In fact, I would summarize Polybase as a more sophisticated version of SQL-H (but without the HCatalog integration), with more query processing pushdown to Hadoop, and a cost-based optimizer that rigorously decides when and how much query processing to push to Hadoop.
The basic difference between Polybase and Hadapt is the following. With Polybase, the basic interface to the user is the MPP database software (and DBMS storage) that Microsoft is selling. Hadoop is viewed as a secondary source of data — if you have a dataset stored inside Hadoop instead of the database system for whatever reason, then the database system can access that Hadoop data on the fly and include that data in query processing alongside data that is already stored inside the database system. However, the user must be aware that she might want to query the data in Hadoop in advance — she must register this Hadoop data to the MPP database through an external table definition (and ideally statistics should be generated in advance to help the optimizer). Furthermore, the Hadoop data must be structured, since the external table definition requires this (so you can’t really access arbitrary unstructured data in Hadoop). The same is true for SQL-H and Hawq — they all can access data in Hadoop (in particular data stored in HDFS), but there needs to be some sort of structured schema defined in order for the database to understand how to access it via SQL. So, bottom line, Polybase/SQL-H/Hawq let you dynamically get at data in Hadoop/HDFS that could theoretically have been stored in the DBMS all along, but for some reason is being stored in Hadoop instead of the DBMS.
HOWEVER, take a look at page 10 of the paper. There are 6 graphs on this page (the same trend is shown on all graphs of the paper, but it is most obvious for the 6 graphs on page 10). The graphs break down query time as far as where time is spent — the green is time spent in the DBMS, the blue is time spent in Hadoop, and the red is time spent importing the data into the DBMS from Hadoop. It is immediately obvious that the vast majority of time spent in all these graphs is in the (red) import stage. The obvious conclusion is that while Polybase lets you access data in Hadoop, the user would have been far better off if the data had been in the DBMS all along (so that this data importing would not be necessary). The same is true for SQL-H and Hawq — although they do not refer to the process of getting data out of HDFS and into their execution engines as a ‘data import’, there is still a data movement process (with its associated overhead), and it is well known that the performance of the original Aster DBMS and the original Greenplum DBMS are faster than the versions of Aster and Greenplum that access data stored in HDFS.
With Hadapt, the picture is completely different. If the data is structured enough to fit in the DBMS, it is loaded into the DBMS storage on each Hadoop/HDFS node. We pay this load cost in advance (obviously, with invisible loading, the story gets more interesting, but let’s ignore that for now to make the comparison easier). Therefore Hadapt has a one-time load cost, but then every subsequent query does not have to worry about paying the high data import cost that you see in the Microsoft paper.
Hence, you will end up seeing Hadapt and Polybase/SQL-H/Hawq used for very different use cases. Polybase/SQL-H/Hawq are appropriate for accessing data in Hadoop in an exploratory/one-off fashion. If the data is not accessed frequently from DBMS queries, then it is appropriate to leave it in Hadoop and access it via Polybase/SQL-H/Hawq those few times that it is needed. However, if the data is repeatedly accessed from the DBMS, it is clearly far better if you get it out of Hadoop and into their MPP database system.
With Hadapt, the distinction of whether data should be stored in raw HDFS or the DBMS storage on each HDFS node is not based on how often it is accessed, but rather whether the data is structured/semi-structured or “truly unstructured” (I have to use the modifier “truly” for unstructured data now that DeWitt redefined the meaning of “unstructured” in his paper to mean any data in Hadoop :D). Structured/semi-structured data goes in the DBMS storage and (truly) unstructured data goes in raw HDFS. Through Hadapt’s SQL extensions for unstructured data (including full text search), queries that span HDFS and the DBMS are not simply about importing data from HDFS into the DBMS like in the Polybase paper, but rather about true structured/unstructured query integration.
Hadapt uses MapReduce, not for the first part of query processing in order to reduce data import cost (like Polybase/SQL-H). Rather, it uses MapReduce to manage long running queries in order to get dynamic query scheduling and runtime fault tolerance. Even if all data that is accessed by the query is already in the DBMS storage, Hadapt might still use the MapReduce engine if the query is predicted to be long-running. (For short queries, as you already know, Hadapt doesn’t use MapReduce, but instead uses the new IQ engine.). This is fundamentally different from Polybase/SQL-H/Hawq — none of which use Hadoop’s MapReduce engine if all the input data is already in the DBMS.
Related links
- Dave DeWitt’s response to this post (June, 2013)
- SQL-H and Hadapt (October, 2012)
- Cloudera Impala (November, 2012)
- Dan Abadi regarding Hawq (February, 2013)
Comments
13 Responses to “SQL-Hadoop architectures compared”
Leave a Reply
Hi Curt,
The paper looks at the Greenplum Database external table feature. It does not cover HAWQ, which _does not_ move data out of Hadoop for computation.
HAWQ accesses data directly within HDFS rather than through external tables. An external table like feature is supported for interacting with other Hadoop services, such as HBase.
Thanks, Gavin.
Actually, the distinction shouldn’t be binary: RDBMS vs. HDFS. Rather, it should be a spectrum, with points on it including:
Unless I’m mistaken, Greenplum Hawq uses HDFS to store files that aren’t good for much except being accessed by Greenplum. If I’m right, then that part of the Hawq story isn’t too dissimilar from Hadapt’s.
Curt,
Thanks for bringing up the point about RDBMS vs. HDFS not being mutually exclusive. A slightly different categorization might be:
1. Ordinary HDFS files.
2. Open format special HDFS files suitable for relational handling (such as the Trevni format for Cloudera Impala).
3. Close format special HDFS files suitable for relational handling (for example, JethroData’s Hadoop analytical RDBMS stores its column and index files as native HDFS files. I suspect that Hawq might work similarly).
4. Non-HDFS files, such as the PostgreSQL tables in Hadapt.
Eli,
What do you mean by open vs. close(d) in this context?
[…] (4) A recent paper describes PolyBase in detail. Also see Hadapt co-founder, Daniel Abadi’s description of how PolyBase and Hadapt differ. (5) To thoroughly compare different systems, a generic benchmark such as the one just released, […]
Hi Curt,
By “closed” i simply meant proprietary, but let me expand a bit more on out this topic.
HDFS, being a file system, does not mandate any specific data format. While a common use of Hadoop is to store raw data, typically in a “simple” row format (what I think you referred to as “Ordinary HDFS files”), various other formats (“Special HDFS files”) are used with equal success. Of these, some Hadoop file formats such as Parquet or HFile, use an open design which means different system can share them. Systems like JethroData use a proprietary file format in HDFS that is not directly shareable.
The question about implementing a database over HDFS really a becomes a question about HDFS’ suitability for the task. There are several limitations that make standard HDFS non-trivial for a database implementation:
– Append only operations, no update or delete
– Relatively large block sizes
– No control over block placement / location
Yet, we have a NoSQL database (HBase) and soon-to-be an analytic RDBMS (JethroData) that are able to run natively over HDFS.
[…] few days ago I posted Daniel Abadi’s thoughts in a discussion of Hadapt, Microsoft PDW (Parallel Data Warehouse)/PolyBase, Pivotal/Greenplum […]
[…] been posting quite a bit about SQL-on-Hadoop. Links can be found in my June Dan Abadi […]
[…] Dan Abadi and Dave Dewitt opined last June about how to categorize Hadapt and Polybase. […]
[…] Dan Abadi and Dave DeWitt both drew distinctions among various SQL/Hadoop integrations. […]
Hi Curt – I’m responding way slow and sorry for this. To disclose who I am, I’m responsible for technical marketing at IBM for our Hadoop offering, InfoSphere BigInsights. This is a recent result, but I wanted to make your readers aware. We recently completed an independently audited benchmark comparing IBM Big SQL, Cloudera Impala, and Hive 0.13. The test was conducted over native Hadoop data formats. (Parquet in our case) Happy to supply letters of attestation around the result from our TPC authorized auditor. Basically, I think our competitors are engaged in some overly aggressive marketing. I’m happy to talk specifics. I think you readers are technically oriented so here is a pointer to a short whitepaper – http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=WH&infotype=SA&appname=SWGE_IM_EZ_USEN&htmlfid=IMW14799USEN&attachment=IMW14799USEN.PDFwww-01.ibm.com/#loaded.
Hi Gord,
I’m not sure why you’re posting a comment about a vendor-sponsored benchmark in response to this particular post. What am I missing?
Great post! Hadoop is often thought of as the one-size-fits-all solution for big data processing problems, the project is limited in its ability to manage large-scale graph processing, stream processing, and scalable processing of structured data. Learn about Big SQL, a massively parallel processing SQL engine that is optimized for processing large-scale structured data. More at http://www.youtube.com/watch?v=1jMR4cHBwZE