December 30, 2009

Clearing up MapReduce confusion, yet again

I’m frustrated by a constant need — or at least urge 🙂 — to correct myths and errors about MapReduce. Let’s try one more time:

MapReduce was named and popularized — but not invented — by Google.
“MapReduce” variously refers to:
- A programming paradigm
- Execution engines that implement the programming paradigm
- Distributed file systems that work with the execution engines
In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).
MapReduce and analytic DBMS can interact in a number of different ways, including:
- Tight integration between a DBMS and exposed MapReduce functionality, e.g. Aster Data’s SQL/MapReduce or Greenplum.
- Integrated MapReduce “under the covers”, e.g. SenSage or Oracle. This may or may not follow all the rules Google laid out for MapReduce, but it’s at least similar in spirit.
- Looser coupling between DBMS and a MapReduce system, e.g. Vertica/Hadoop, in which MapReduce may or may not run on a different cluster than the DBMS.
- Not at all, except perhaps insofar as a quasi-DBMS such as Hive is implemented over a MapReduce system such as Hadoop/HDFS.
As predicted by Monash’s First Law of Commercial Semantics, different vendors have individual variants on those themes. For example, as per a registration-required white paper, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.
MapReduce implementations such as Hadoop are sometimes regarded as part of the NoSQL “movement”. When they are, many generalities about NoSQL — such as that it doesn’t deal with analytics — are falsified.
So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven’t done much to adopt it yet. Probably that’s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.
Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.
Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.
MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.

Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Hadoop, MapReduce, SenSage, Splunk

Subscribe to our complete feed!

Comments

8 Responses to “Clearing up MapReduce confusion, yet again”

Manuel Simoni on December 30th, 2009 10:05 am

Re “MapReduce was named and popularized — but not invented — by Google.”

Can you point to any projects that used MapReduce before it was popularized by Google?

I can imagine that there were projects that used something similar to (subsets of) MapReduce before the MR paper was published, but I am not aware of any that are as general and well-specified as Google MR.
Curt Monash on December 30th, 2009 11:47 am

I don’t know that anybody abstracted it exactly the way Google did before Google. But it also wasn’t a conceptual breakthrough on par with, say, Codd’s idea for a relational DBMS. The predecessor ideas were floating around pretty thickly.
UnHolyGuy on December 30th, 2009 5:38 pm

I think this is the cleanest definition I’ve seen yet

http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php

“MapReduce is a library that lets you adopt a particular, stylized way of programming that’s easy to split among a bunch of machines”

There were tons of scatter/gather distributed/grid computing work prior to map reduce. The conceptual breakthrough I think was that by enforcing that everything had to be a key/value it makes the code easy to write, and distributable without an optimizer.

You build the enforcement of the distributablilty into the syntax so to speak. That was pretty smart.

The name itself comes from the Map and Reduce primitives that you find in a lot of languages I think?
Omer Trajman on December 31st, 2009 12:31 am

Curt – it’s great to see these posts and I’d expect you’ll find the need to write more as MR gains adoption.

@UnHolyGuy cites a good source for defining MR. Note that MapReduce is distinctly different from distributed file systems. As Jeff and Sanjay write in a recent CACM article: “MapReduce is storage-system independent”[1]. Tom White has an entire chapter in his Hadoop book where he discusses how “Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.”[2]

As an example, you can run Hadoop/MapReduce on top of Amazon S3 (http://wiki.apache.org/hadoop/AmazonS3) or using Vertica without HDFS (http://www.vertica.com/Hadoop)

[1] http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
[2] http://books.google.com/books?id=bKPEwR-Pt6EC&lpg=PP1&ots=kOcy-DcbHh&pg=PA49#v=onepage&q=&f=false

@Manuel – a UnHolyGuy also points out – prior to Google’s paper, distributed Map Reduce type operations might have been called Vectored or Scatter/Gather and were generally run on very large shared subsystem (SMP) machines such as Cray X-MP. Google’s breakthrough was running on shared nothing (MPP) clusters and simplifying the model to use key/value records.
UnHolyGuy on December 31st, 2009 6:36 pm

@Omer, actually I’m pretty sure the US DoE (Sandia, LLNL and Los Alamos) were running distributed computing jobs on shared nothing linux MPP clusters before google. Mostly simulating nuclear explosions and solving big physics problems.
Neil Raden on January 2nd, 2010 12:35 am

UnHolyGuy,

I can confirm what you said about the national labs. We also used these techniques for the nuclear waste repository program, doing massive stochastic processes on 3-D geology, etc.
かなり気になる ScalOut 関連記事 6本 « Agile Cat — Azure & Hadoop — Talking Book on January 4th, 2010 7:43 pm

[…] up MapReduce confusion, yet again http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/ •MapReduce was named and popularized — but not invented — by Google. •“MapReduce” […]
Invention – Overloaded… « Dudefrommangalore's Weblog on November 6th, 2011 11:36 pm

[…] Good article on DBMS2 clear this confusion. MapReduce was named and popularized — but not invented — by Google. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Clearing up MapReduce confusion, yet again

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin