Stonebraker, DeWitt, et al. compare MapReduce to DBMS
Along with five other coauthors — the lead author seems to be Andy Pavlo — famous MapReduce non-fans Mike Stonebraker and David DeWitt have posted a SIGMOD 2009 paper called “A Comparison of Approaches to Large-Scale Data Analysis.” The heart of the paper is benchmarks of Hadoop, Vertica, and “DBMS-X” on identical clusters of 100 low-end nodes., across a series of tests including (if I understood correctly):
- A couple of different flavors of a Grep task originally proposed in a Google MapReduce paper.
- A database query on simulated clickstream data
- A join on the same clickstream data.
- Two aggregations on the clickstream data.
Both DBMS outshone Hadoop, and Vertica outperformed DBMS-X. This was true both on the Grep task, and also on all the other DBMS-like tasks the authors specified. Reasons for the DBMS outdoing Hadoop included compression and optimization. Reasons for Vertica outdoing DBMS-X included the usual benefits of column stores.
More precisely, both DBMS clobbered Hadoop on throughput. Hadoop, however, had some advantages in load speed and the like.
The paper also argues strenuously that for complex and/or team-oriented database programming, one is much better off using a DBMS rather than reinventing the software wheel. However, it concedes that for simple programming tasks, Hadoop may be easier and lighter-weight. For example, some of the benchmark tasks required user-defined functions (UDFs) or the equivalent, and those weren’t as easy to write in the DBMS as one might think.
Frankly, the paper is less extremely anti-MapReduce than I expected based on the authorship, or on how Mike Stonebraker framed it to me when he told me about it Monday afternoon. That said, it is absolutely in line with the DeWitt/Stonebraker meme “MapReduce isn’t nearly as good for DBMS-style processing as a DBMS is.”
Comments
6 Responses to “Stonebraker, DeWitt, et al. compare MapReduce to DBMS”
Leave a Reply
[…] the benchmark particulars, and eventually posted a link to the paper to. And I rushed out several related blog […]
I’m neither anti or pro MapReduce, probably because I have only read about it. Is it purely that it indirectly casts aside a DBMS? MapReduce strikes me as something to be used to categorize (key) large blobs of data where the only three things you know at the time of categorization are that you have a blob of data, you’ll get some random key, and you’ll get more blobs of data at a later time. Are the anti-MapReduce (pro-DBMS?) people saying that you should go analyze all the blobs and key them every possible way into a schema? Or are they saying that there is a solution in between the two solutions?
While we agree with many of the points in the study, it misses the big picture…why wouldn’t you use both SQL AND MapReduce? Asking if you should use SQL OR MapReduce is like asking if you should tie your left or right hand behind your back. SQL is very good at some things, and MapReduce is very good at others. Why not leverage the best of both worlds – use SQL for traditional database operations and MapReduce for richer analysis that cannot be expressed by SQL, in a single system.
While the study notes that MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, we have eliminated that hassle by providing both SQL and MapReduce capabilities. So essentially, our customers can maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis.
At the end of the day, MapReduce is a technology that some vendors are and should be quite afraid of (isn’t that usually why they sponsor studies? ;-), since it provides some amazing capabilities. As a developer or DBA, why on earth wouldn’t you leverage the power of both?
Thanks,
Steve
P.S. We recently blogged about our Enterprise-class MapReduce capabilities and noted the key advantages that a system like ours provides over a pureplay MapReduce implementation – http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/
Here are even more examples of why you would want to use both SQL and MapReduce: http://www.asterdata.com/blog/index.php/2009/03/13/sqlmapreduce-faster-answers-to-your-toughest-queries/
Something very odd about the Hadoop/Java tests. The JVM arguments say “-client”. Now, anybody who has worked on Java can tell you that you are supposed to use the “-server” option. The server and client JVM optimizations are worlds apart.
[…] up for his usual provocative comments – see the discussion of Mapreduce on Curt Monash’s blog here for a great example, if you like to dig deep. (Curt also has several useful posts about Vertica well […]
[…] In general, seeing Abadi be so favorable toward Vertica competitors adds credibiity to the recent Hadoop vs. DBMS paper. […]