January 18, 2008

The Great MapReduce Debate

Google’s highly parallel file manipulator MapReduce has gotten great attention recently, after a research paper revealed:

MapReduce is running the core Google search engine, plus much of Google Analytics and other applications.
MapReduce is processing 400+ petabytes of data per month.

(Niall Kennedy popularized the paper and surveyed its results.)

David DeWitt and Mike Stonebraker then launched a blistering attack on MapReduce, accusing it of disregarding almost all the lessons of database management system theory and practice. A vigorous comment thread has ensued, pointing out that MapReduce is not a DBMS and asserting it therefore shouldn’t be judged as one.

While correct, that defense begs the question – what is MapReduce good for? Proponents of MapReduce highlight two advantages:

MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
MapReduce runs in massively parallel mode “for free,” without extra programming.

Based on those advantages, MapReduce would indeed seem to have significant uses, including:

Specialized indexing of large quantities of data. Obviously, MapReduce was built for text indexing of the Web. But it would likely also be useful for, say, preprocessing satellite telemetry or intelligence intercepts, or for doing early steps in large-scale network traffic analysis. MapReduce may not be good for data management, but it looks good for banging stuff into specialized data management systems.
Computer-scientific research. If you’re trying to figure out better ways to, say, digest and analyze huge amounts of astronomical data, MapReduce seems like a great platform. Today’s researchers – even the students – aren’t nearly as adept at parallel algorithms as one would hope. Perhaps we should take those complications away to let them focus on the unique parts of their work. Breakthrough programming is hard enough anyway, especially if you’re trying to do all the work yourself.

I agree that MapReduce will have limited applicability to problems that relational database management systems handle well. But there are plenty of things that relational database management systems don’t handle well, and MapReduce could be very useful for some of them.

Some August, 2008 links about MapReduce

Categories: Cloud computing, MapReduce, Michael Stonebraker

Subscribe to our complete feed!

Comments

10 Responses to “The Great MapReduce Debate”

Text Technologies»Blog Archive » 19 Microsoft/Yahoo synergies that could revolutionize the Internet on February 3rd, 2008 6:04 pm

[…] Ditto. (Recent discussion of Google MapReduce quantifies this processing effort a […]
Daniel Weinreb on February 5th, 2008 8:14 am

Yesterday, at the New England Database Day conference, Prof. DeWitt gave an invited talk. It became far clearer to me in what manner he was comparing Map/Reduce with parallel database systems.

The blog entry that we’ve all been reading is confusing. Like everyone else, my reaction was “Well, Map/Reduce never said that it was a database system, so why are you criticizing it as if it were?”

What he’s mainly comparing is the overall job scheduling strategy of Map/Reduce versus the kind of job scheduling used by parallel RDBMS’s. The Map/Reduce pattern is obviously a useful tool for certain problems. His point is that a more general parallel database system can choose among many patterns, of which Map/Reduce is only one example, and therefore can be a good solution for a wider range of problems. Furthermore, you can issue a declarative query (i.e. in a query language such as SQL or relational algebra), and an automatic optimizer can choose which of those patterns to use for your particular problem, provided that you have an actual DBMS with a schema and so forth.

In this context, what the blog entry says makes a lot more sense.

I hope Prof. DeWitt writes up his talk as a paper, which would make this all a lot more clear.
Curt Monash on February 5th, 2008 1:19 pm

Which nobody thought to tell me about. ::sigh::

Actually, I’m glad they didn’t. I’ve been coughing a lot, and wound up sleeping 14 hours yesterday to good effect. So missing the conference was probably a good thing.

CAM
Google has thousands of internal data formats, mostly simple ones | DBMS2 -- DataBase Management System Services on July 8th, 2008 2:27 pm

[…] to think of it, that sounds very consistent with the idea that MapReduce solves a large fraction of Google’s data management issues. Share: These icons link to […]
Greg Holmberg on July 9th, 2008 7:39 pm

What is MapReduce good for? Take a look at the Mahout project at Apache: http://lucene.apache.org/mahout

Mahout implements a variety of machine learning algorithms, many of which are useful in text mining. Mahout builds on Apache Hadoop, which is an implementation of MapReduce. Both are sub-projects of the Lucene search engine. If all this Lucene code comes together at some point in the future, then Lucene will be much more than a search engine.

I’m sure Google is working on using MapReduce for some of the same algorithms–i.e. that text mining is in Google’s future.
Rod Oldehoeft on August 26th, 2008 8:46 pm

Check out MRNet:
http://www.paradyn.org/mrnet/
It is becoming the de facto utility for implementing scalable Multicasts and Reductions on high-performance technical computing systems, especially for performance analysis tools. Unlike Hadoop MapReduce, which is file-based, MRNet uses one or more tree-based overlay networks (TBON) over the physical network topology of current (IBM BG/*, Cray XT*) high-end systems (and clusters, too, of course). Each TBON uses the same filter function when used for reduction, so for different filtering functions you instantiate a TBON for each. Unlike MapReduce, MRNet also implements multicast communication over the same TBONs.

It’s not a database system, either.
Bazy danych bez SQL « data mining à la polonaise on November 24th, 2009 9:00 pm

[…] danych, warto słuchać, co ma do powiedzenia Michael Stonebraker. Szczególnie, gdy bierze się za obronę systemów zarządzania bazami danych przed różnymi zakusami, np. przed powrotem do pre-relacyjnej […]
Search Facets » MapReduce just semi-good for semi-structured data on January 18th, 2010 5:27 pm

[…] like Aster and Greenplum, followed fairly quickly by others such as Netezza and (somewhat surprisingly) […]
CAP equivalent for analytics? « Big Data Craft on October 10th, 2010 12:56 pm

[…] it will make suboptimal (considering my model context not in wider sense!) great SV system. The great MapReduce debate is not for […]
NoSQL Daily – Sat Nov 13 › PHP App Engine on November 12th, 2010 9:16 pm

[…] The Great MapReduce Debate | DBMS2 : DataBase Management System Services […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The Great MapReduce Debate

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin