September 4, 2008
Mike Stonebraker’s counterarguments to MapReduce’s popularity
In response to recent posting I’ve done about MapReduce, Mike Stonebraker just got on the phone to give me his views. His core claim, more or less, is that anything you can do in MapReduce you could already do in a parallel database that complies with SQL-92 and/or has PostgreSQL underpinnnings. In particular, Mike says:
- Map functions can’t do anything you can’t also do in PostgreSQL user-defined functions (assuming, of course, PostgreSQL UDFs can be written in the language you want to use).
- Reduce functions can’t do anything you can’t also do in PostgreSQL user-defined aggregates (with the same caveat).
- Map and Reduce functions always write their result sets to disk. This can create a large performance loss.
- Map and Reduce functions require new instances to be fired up to run them. This can also create a large performance loss. (Without checking, I’m guessing that one is very implementation-specific. I.e., even if it’s true of Hadoop, it may not be true of Greenplum’s or Aster Data’s MapReduce implementations.)
- Mike and his associates are working on benchmarks that he believes will show that MapReduce performance is 10X worse than parallel row-based SQL DBMS, and 100X worse than columnar SQL DBMS.
- MapReduce doesn’t play nicely with the SQL Analytics part of the SQL standard.
- The one advantage Mike concedes to MapReduce — more graceful degradation when nodes fail — isn’t that important in the hardware configurations on which parallel analytic DBMS actually run today. I.e., a Greenplum or Vertica installation is going to have nodes fail much more rarely than a Google data center will.
Bottom line: Mike Stonebraker more than disagrees with the claim that MapReduce is a valuable addition to SQL data warehousing, on somewhat different grounds than he emphasized in the Great MapReduce Debate last January.
Categories: Data warehousing, MapReduce, Michael Stonebraker, PostgreSQL
Subscribe to our complete feed!
Comments
5 Responses to “Mike Stonebraker’s counterarguments to MapReduce’s popularity”
Leave a Reply
Overall comments:
1) Dr. Stonebraker should do his homework. All of his comments presented as facts are factually incorrect on every point as shown below.
2) he misses the point of MR, it’s about a choice of language and paradigm.
Point by point:
> Map functions can’t do anything you can’t also do in PostgreSQL user-defined functions (assuming, of course, PostgreSQL UDFs can be written in the language you want to use).
Wrong: The Postgres UDFs materialize the entire stream before returning, which won’t work like executor nodes that can return an infinite stream. See comment here:
http://www.postgresql.org/docs/8.3/interactive/plpgsql-control-structures.html
> Reduce functions can’t do anything you can’t also do in PostgreSQL user-defined aggregates (with the same caveat).
Wrong: see above, plus: Postgres UDAs output one row, not a stream.
> Map and Reduce functions always write their result sets to disk. This can create a large performance loss.
Wrong: this is implementation dependent. Hadoop does this, but other implementations can flow data from one reduce function to the next map function without materializing results.
> Map and Reduce functions require new instances to be fired up to run them. This can also create a large performance loss. (Without checking, I’m guessing that one is very implementation-specific. I.e., even if it’s true of Hadoop, it may not be true of Greenplum’s or Aster Data’s MapReduce implementations.)
Wrong: this is implementation dependent. Hadoop does this but others don’t.
> Mike and his associates are working on benchmarks that he believes will show that MapReduce performance is 10X worse than parallel row-based SQL DBMS, and 100X worse than columnar SQL DBMS.
Let me guess, who is participating?
> MapReduce doesn’t play nicely with the SQL Analytics part of the SQL standard.
Wrong: implementation dependent.
Arguing about the expressiveness of SQL vs MapReduce misses the point. While SQL with UDFs can clearly emulate MapReduce, one can argue just as easily that MapReduce can emulate anything SQL can do. In fact SQL-like query languages on MapReduce, such as Hive, are already available.
Nor is MapReduce as a programming model inherently slower: while the MapReduce described in the Google paper was designed for very long-running computations on very cheap hardware where writing intermediate results to disk made sense, it is possible to optimize MapReduce implementations in whatever way is desired while keeping the same programming model. For example, when Yahoo beat the TeraSort benchmark using Hadoop, they set a config parameter to have Hadoop store intermediate data in an in-memory filesystem rather than on disk.
Instead, MapReduce differentiates itself from parallel databases on the API provided and scaling properties of the implementations. MapReduce has significant advantages over traditional parallel databases in three areas:
1) Scalability. With MapReduce, it’s trivial to scale to petabytes of data on thousands of machines. While parallel SQL DBs may perform better on smaller data sets and smaller cluster, when you have a problem that requires petabytes of data, you simply can’t do it with them.
2) Cost. With the reliability features in MapReduce, you can afford to buy cheaper hardware and spend less money on operations. Again, as your data size increases, you care more about this.
3) Usability. Writing an UDF in Postgres is painful compared to writing a 20-line Python script to run a MapReduce job. Registering and deploying the UDF is probably also not easy, while MapReduce is designed as a service for running arbitrary code. Finally, administering a cluster running Hadoop is very easy compared to administering and tweaking a full-blown commercial database system.
Most interestingly, there is a philosophical difference between SQL databases and MapReduce. SQL databases are built for use by business analysts (who prefer to write in SQL) and various SQL-speaking tools. In contrast, MapReduce has traditionally been designed for use by programmers (languages like Hive are just now making it friendlier for analysts and SQL tools). This makes sense when you realize that MapReduce is coming out of Google and Yahoo, which have huge talent pools of engineers that need to do parallel computation, whereas parallel SQL DBs are coming from the DB community, whose users want JDBC compatibility and SQL. As a result, the type of service provided is very different. In the DB world, the DBMS is an oracle, and you just have to specify your query, after which the smarts of the team that built the database will let it figure out how to best execute it. You depend entirely on the smarts of your DB company to figure out how to run your query, but the good news is that they’re often pretty good at this. In the MapReduce world, the MR system is a library which takes care of the common work required for parallel computations, but you as the programmer have much more control over the execution of your query. This means that you can start out really simple, with a Python script for answering a random question you just had about some 10 TB of log data in five minutes, and trade performance for rapid development, or build highly optimized, thousand-line MapReduce jobs for your throughput-critical queries using the full control that MapReduce provides. Ultimately it may be very foolish to dismiss MapReduce because an automated DBMS might never be as good as a programmer at optimizing queries.
M,
Interesting arguments. But I’d like to dispute one part. Parallel SQL DBMS do in fact store petabytes of data. Indeed, the most visible already-in-production multi-petabyte SQL DBMS is at Yahoo, a company that also has a considerable fondness for MapReduce.
Thanks for your wonderfully detailed comment,
CAM
[…] database academics (that participated in the report) have repeatedly shown their depreciation and ignorance on the […]
Check out CloudBase-
http://cloudbase.sourceforge.net
It is a data warehouse system built on top of Hadoop’s Map Reduce architecture that allows one to query Terabyte and Petabyte of data using ANSI SQL. It comes with a JDBC driver so one can use third party BI tools, reporting frameworks to directly connect to CloudBase.
CloudBase creates a database system directly on flat files and converts input ANSI SQL expressions into map-reduce programs for processing flat files. It has an optimized algorithm to handle Joins and plans to support table indexing in next release.