The Great MapReduce Debate
Google’s highly parallel file manipulator MapReduce has gotten great attention recently, after a research paper revealed:
- MapReduce is running the core Google search engine, plus much of Google Analytics and other applications.
- MapReduce is processing 400+ petabytes of data per month.
(Niall Kennedy popularized the paper and surveyed its results.)
David DeWitt and Mike Stonebraker then launched a blistering attack on MapReduce, accusing it of disregarding almost all the lessons of database management system theory and practice. A vigorous comment thread has ensued, pointing out that MapReduce is not a DBMS and asserting it therefore shouldn’t be judged as one.
While correct, that defense begs the question – what is MapReduce good for? Proponents of MapReduce highlight two advantages:
- MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
- MapReduce runs in massively parallel mode “for free,” without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including:
- Specialized indexing of large quantities of data. Obviously, MapReduce was built for text indexing of the Web. But it would likely also be useful for, say, preprocessing satellite telemetry or intelligence intercepts, or for doing early steps in large-scale network traffic analysis. MapReduce may not be good for data management, but it looks good for banging stuff into specialized data management systems.
- Computer-scientific research. If you’re trying to figure out better ways to, say, digest and analyze huge amounts of astronomical data, MapReduce seems like a great platform. Today’s researchers – even the students – aren’t nearly as adept at parallel algorithms as one would hope. Perhaps we should take those complications away to let them focus on the unique parts of their work. Breakthrough programming is hard enough anyway, especially if you’re trying to do all the work yourself.
I agree that MapReduce will have limited applicability to problems that relational database management systems handle well. But there are plenty of things that relational database management systems don’t handle well, and MapReduce could be very useful for some of them.
Some August, 2008 links about MapReduce
- Three major applications of MapReduce
- Another application of MapReduce
- Sound bites about MapReduce
- Other links about MapReduce
Comments
10 Responses to “The Great MapReduce Debate”
Leave a Reply
[…] Ditto. (Recent discussion of Google MapReduce quantifies this processing effort a […]
Yesterday, at the New England Database Day conference, Prof. DeWitt gave an invited talk. It became far clearer to me in what manner he was comparing Map/Reduce with parallel database systems.
The blog entry that we’ve all been reading is confusing. Like everyone else, my reaction was “Well, Map/Reduce never said that it was a database system, so why are you criticizing it as if it were?”
What he’s mainly comparing is the overall job scheduling strategy of Map/Reduce versus the kind of job scheduling used by parallel RDBMS’s. The Map/Reduce pattern is obviously a useful tool for certain problems. His point is that a more general parallel database system can choose among many patterns, of which Map/Reduce is only one example, and therefore can be a good solution for a wider range of problems. Furthermore, you can issue a declarative query (i.e. in a query language such as SQL or relational algebra), and an automatic optimizer can choose which of those patterns to use for your particular problem, provided that you have an actual DBMS with a schema and so forth.
In this context, what the blog entry says makes a lot more sense.
I hope Prof. DeWitt writes up his talk as a paper, which would make this all a lot more clear.
Which nobody thought to tell me about. ::sigh::
Actually, I’m glad they didn’t. I’ve been coughing a lot, and wound up sleeping 14 hours yesterday to good effect. So missing the conference was probably a good thing.
CAM
[…] to think of it, that sounds very consistent with the idea that MapReduce solves a large fraction of Google’s data management issues. Share: These icons link to […]
What is MapReduce good for? Take a look at the Mahout project at Apache: http://lucene.apache.org/mahout
Mahout implements a variety of machine learning algorithms, many of which are useful in text mining. Mahout builds on Apache Hadoop, which is an implementation of MapReduce. Both are sub-projects of the Lucene search engine. If all this Lucene code comes together at some point in the future, then Lucene will be much more than a search engine.
I’m sure Google is working on using MapReduce for some of the same algorithms–i.e. that text mining is in Google’s future.
Check out MRNet:
http://www.paradyn.org/mrnet/
It is becoming the de facto utility for implementing scalable Multicasts and Reductions on high-performance technical computing systems, especially for performance analysis tools. Unlike Hadoop MapReduce, which is file-based, MRNet uses one or more tree-based overlay networks (TBON) over the physical network topology of current (IBM BG/*, Cray XT*) high-end systems (and clusters, too, of course). Each TBON uses the same filter function when used for reduction, so for different filtering functions you instantiate a TBON for each. Unlike MapReduce, MRNet also implements multicast communication over the same TBONs.
It’s not a database system, either.
[…] danych, warto słuchać, co ma do powiedzenia Michael Stonebraker. Szczególnie, gdy bierze się za obronę systemów zarządzania bazami danych przed różnymi zakusami, np. przed powrotem do pre-relacyjnej […]
[…] like Aster and Greenplum, followed fairly quickly by others such as Netezza and (somewhat surprisingly) […]
[…] it will make suboptimal (considering my model context not in wider sense!) great SV system. The great MapReduce debate is not for […]
[…] The Great MapReduce Debate | DBMS2 : DataBase Management System Services […]