December 30, 2009
Clearing up MapReduce confusion, yet again
I’m frustrated by a constant need — or at least urge 🙂 — to correct myths and errors about MapReduce. Let’s try one more time:
- MapReduce was named and popularized — but not invented — by Google.
- “MapReduce” variously refers to:
- A programming paradigm
- Execution engines that implement the programming paradigm
- Distributed file systems that work with the execution engines
- In particular, Hadoop is a MapReduce execution engine that includes or is closely associated with HDFS (Hadoop Distributed File System).
- MapReduce and analytic DBMS can interact in a number of different ways, including:
- Tight integration between a DBMS and exposed MapReduce functionality, e.g. Aster Data’s SQL/MapReduce or Greenplum.
- Integrated MapReduce “under the covers”, e.g. SenSage or Oracle. This may or may not follow all the rules Google laid out for MapReduce, but it’s at least similar in spirit.
- Looser coupling between DBMS and a MapReduce system, e.g. Vertica/Hadoop, in which MapReduce may or may not run on a different cluster than the DBMS.
- Not at all, except perhaps insofar as a quasi-DBMS such as Hive is implemented over a MapReduce system such as Hadoop/HDFS.
- As predicted by Monash’s First Law of Commercial Semantics, different vendors have individual variants on those themes. For example, as per a registration-required white paper, Splunk is moving to publicly expose a not-quite-complete form of MapReduce.
- MapReduce implementations such as Hadoop are sometimes regarded as part of the NoSQL “movement”. When they are, many generalities about NoSQL — such as that it doesn’t deal with analytics — are falsified.
- So far as I can tell, mainstream enterprise (as opposed to web, scientific, investment, etc.) data mining folks may be looking at MapReduce for data mining, but they haven’t done much to adopt it yet. Probably that’s because the outfits who have the greatest need are the same ones that have the largest sunk investments in more traditional ways of doing data mining.
- Cloudera != Hadoop. On the other hand, if you want to use Hadoop, it makes a lot of sense to do business with Cloudera.
- Non-DBMS MapReduce != Hadoop. On the other hand, Hadoop is the default choice for non-DBMS MapReduce.
- MapReduce != Hadoop, period. DBMS-based MapReduce is also a legitimate technical strategy.
Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Google, Hadoop, MapReduce, SenSage, Splunk
Subscribe to our complete feed!
Comments
8 Responses to “Clearing up MapReduce confusion, yet again”
Leave a Reply
Re “MapReduce was named and popularized — but not invented — by Google.”
Can you point to any projects that used MapReduce before it was popularized by Google?
I can imagine that there were projects that used something similar to (subsets of) MapReduce before the MR paper was published, but I am not aware of any that are as general and well-specified as Google MR.
I don’t know that anybody abstracted it exactly the way Google did before Google. But it also wasn’t a conceptual breakthrough on par with, say, Codd’s idea for a relational DBMS. The predecessor ideas were floating around pretty thickly.
I think this is the cleanest definition I’ve seen yet
http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php
“MapReduce is a library that lets you adopt a particular, stylized way of programming that’s easy to split among a bunch of machines”
There were tons of scatter/gather distributed/grid computing work prior to map reduce. The conceptual breakthrough I think was that by enforcing that everything had to be a key/value it makes the code easy to write, and distributable without an optimizer.
You build the enforcement of the distributablilty into the syntax so to speak. That was pretty smart.
The name itself comes from the Map and Reduce primitives that you find in a lot of languages I think?
Curt – it’s great to see these posts and I’d expect you’ll find the need to write more as MR gains adoption.
@UnHolyGuy cites a good source for defining MR. Note that MapReduce is distinctly different from distributed file systems. As Jeff and Sanjay write in a recent CACM article: “MapReduce is storage-system independent”[1]. Tom White has an entire chapter in his Hadoop book where he discusses how “Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.”[2]
As an example, you can run Hadoop/MapReduce on top of Amazon S3 (http://wiki.apache.org/hadoop/AmazonS3) or using Vertica without HDFS (http://www.vertica.com/Hadoop)
[1] http://cacm.acm.org/magazines/2010/1/55744-mapreduce-a-flexible-data-processing-tool/fulltext
[2] http://books.google.com/books?id=bKPEwR-Pt6EC&lpg=PP1&ots=kOcy-DcbHh&pg=PA49#v=onepage&q=&f=false
@Manuel – a UnHolyGuy also points out – prior to Google’s paper, distributed Map Reduce type operations might have been called Vectored or Scatter/Gather and were generally run on very large shared subsystem (SMP) machines such as Cray X-MP. Google’s breakthrough was running on shared nothing (MPP) clusters and simplifying the model to use key/value records.
@Omer, actually I’m pretty sure the US DoE (Sandia, LLNL and Los Alamos) were running distributed computing jobs on shared nothing linux MPP clusters before google. Mostly simulating nuclear explosions and solving big physics problems.
UnHolyGuy,
I can confirm what you said about the national labs. We also used these techniques for the nuclear waste repository program, doing massive stochastic processes on 3-D geology, etc.
[…] up MapReduce confusion, yet again http://www.dbms2.com/2009/12/30/clearing-up-mapreduce-confusion-yet-again/ •MapReduce was named and popularized — but not invented — by Google. •“MapReduce” […]
[…] Good article on DBMS2 clear this confusion. MapReduce was named and popularized — but not invented — by Google. […]