Three big myths about MapReduce
Once again, I find myself writing and talking a lot about MapReduce. But I suspect that MapReduce-related conversations would go better if we overcame three fairly common MapReduce myths:
- MapReduce is something very new
- MapReduce involves strict adherence to the Map-Reduce programming paradigm
- MapReduce is a single technology
So let’s give it a try.
When Dave DeWitt and Mike Stonebraker leveled their famous blast at MapReduce, many people thought they overstated their case. But one part of their story – one that both Mike and Dave say was most central to their case – was never effectively refuted, namely the claim that these ideas aren’t particularly new. I haven’t actually read enough computer science literature to have an independent opinion on that issue. But I’ll say this – claims from companies such as SenSage, Oracle, or Splunk that “We’ve been doing MapReduce all along” seem pretty credible to me.
True, what those companies were doing things may not have looked exactly like the instant-classic MapReduce programming paradigm. But the same is true of many things almost everybody would agree count as MapReduce. In particular, it is often not the case that you alternate Map and Reduce steps, each of whose outputs is a set of simple <Key, Value> pairs, with data redistributed based on Key at every step.
Here are some examples of what I mean, drawn from my recent MapReduce webinar.
- If you do text indexing in MapReduce, your goal is to wind up with a text index. So at some point you Reduce to a pair <WordName, {all the (DocumentID, offset) pairs for the whole corpus, suitably ordered}>. That’s a heckuva compound “Value”.
- The goal of data mining is usually to estimate a rather small number of parameters based on a large overall data set, often – depending on algorithm – in the form of a single vector. When you do that in MapReduce. you partition data among nodes, calculate something on each node that is structured more or less like your final vector. So when it comes time for the reduce, you just ship all of your vectors – one per node – to a single Reduce node, and do the appropriate math. Redistribution based on Key would be quite pointless.
- When you sessionize clickstream logs in MapReduce, you may have just as many output records as input records. However, they now are reformatted, and might have a SessionID appended. In those cases, Reduce isn’t doing much by the way of reduction.
- And as I happens in some Vertica-Hadoop use cases around mortgage trading, sometimes MapReduce can even make data sets vastly larger.
By no means do I think this is a weakness of the MapReduce programming paradigm. Rather, I think it’s a MapReduce strength. But it’s not quite the way MapReduce has been promoted and explained to the IT public.
Finally: MapReduce, as commonly conceived, spans two different – albeit closely related – technology domains:
- Parallel programming
- Distributed data management
For example, I imagine Greenplum’s and Vertica’s MapReduce/SQL combined syntaxes are very similar to each others. But Vertica’s data management implementation of MapReduce, which relies on Hadoop, is very different from Greenplum’s, which is tied into the Greenplum DBMS. Similary, non-DBMS MapReduce implementations are commonly associated with distributed file systems – notably HDFS (Hadoop Distributed File Systems) or Google’s internal GFS (Google File System). In those systems, the parallel language execution part should be aware of how the distributed file management part works – but perhaps that awareness can be pretty lightweight.
Right now, this is a distinction pretty much without a difference. If you choose an implementation of MapReduce — like pure Hadoop (say in the Cloudera distribution) or Hadoop-Vertica or Aster Data’s SQL/MapReduce – you’re basically picking an entire technology stack. But those stacks are going to do a whole lot of changing and maturing in the near future – and as they do, it’s likely that projects will interact or even combine in all sorts of interesting ways.
Bottom line: There are a lot of different ways to exploit MapReduce-related technology.
Comments
11 Responses to “Three big myths about MapReduce”
Leave a Reply
“you’re basically picking an entire technology stack” – And herein lies the question I’ve had from the beginning, but which no one at Aster ever answered (that I know of) – If I’ve invested time/effort/resources into writing canned SQL/MR functions on the Aster stack, is that stuff portable to another platform? And do I get the source code that comes with the canned SQL/MR functions which (as best I can tell) get shipped with AsterData? If the answer is “no” then it seems to me you’re making quite a commitment to Aster (or GP I suppose) when buying into their SQL/MR sauce if it’s not portable. Do you have any insight into that?
Thanks
J.
Jerome,
The Aster syntax is Aster-specific, just as if you used any other vendor’s proprietary SQL extensions.
CAM
Jerome’s point is dead on; SQL/MR is analogous to vendors custom SQL Extension. That is one of my concerns about these MR extensions; they introduce vendor lock-in.
And as for the recent claim by many that they’ve been “doing MapReduce all along”, the simple answer is this: they have been doing something VERY MUCH LIKE MapReduce all along.
MPP databases horizontally partition the data and process partitions on distinct nodes. MapReduce does not perform the partitioning apriori, it does it at runtime. MPP implementations that I am familiar with always perform the partitioning of persistent data (tables) apriori with provisions to redistribute the data as part of the query processing mechanism.
Dean & Ghemawat write, “The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits. The input splits can be processed
in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R). The number of partitions (R) and the partitioning function are specifed by the user.”
Each MPP implementation has a different name for the mechanism to perform this splitting. In effect therefore MapReduce is another mechanism for MPP’izing the solution to a problem and there is some merit to the claim being made by MPP database vendors that they’ve been doing MapReduce all along.
Social comments and analytics for this post…
This post was mentioned on Twitter by jameskobielus: Read @CurtMonash on MapReduce (http://bit.ly/2pdJ1W). None of this brand new. Nor is it true standard. Vendor implementations vary widely….
A good point, but while Map Reduce is not new I feel it emphasized clarity and simplicity (at least for the problem of sorting), so that is probably why it markets easier than MPI or a database. I wrote a bit on this point some time ago: http://www.win-vector.com/blog/2009/01/map-reduce-a-good-idea/
@Jerome: Our customers write SQL-MR functions to do computations on data that would have been extremely complicated, error-prone or slow-performing if done using only SQL. Therein lies a key motivator – customers consciously choose SQL-MR for convenience as opposed to being transparently locked-in.
Our SQL-MR syntax goes a long way in ensuring that the relational model is preserved. For example, MR functions consume and produce relations; MR invocations are modeled as stored-procedure invocations. This means that a customer can migrate to non-Aster nCluster installations with an effort similar to migrating their user-defined functions from one platform to another.
The best part of our SQL-MR framework is that the implementation of the MR functions are in open languages chosen by the customer (e.g., Java, Perl, Python, C++, C#, Ruby, etc.). This means that the actual code is not proprietary to Aster nCluster. The code snippets/sub-functions can be re-used in other platforms as well. In addition, the Map Reduce programming model has widespread popularity allowing for portability since the structure of one’s code is first to design the Map-Reduce design, and secondarily express it in SQL-MR to the extent that one uses features unique to our platform.
The libraries of Aster SQL-MR functions that we provide are, of course, proprietary. They have innovations in data structure and processing that ensure high performance of the compute function. We’ve published the source code of some of these functions; for others, we’ve published the algorithms but not the source code; for the rest, we may not publish either the code/algorithm. In fact, we encourage our partners who write SQL-MR functions complete discretion on publishing their functions or providing only binaries to protect their IP.
The important point to note here is that we are committed to providing an open platform in which one function is not forced upon the end-user.
===
@Amrith: Whenever an innovative system becomes mainstream, there are always claims that the innovation is nothing new! We went through this in the 1990s when Java appeared on the scene as well.
We cannot equate Map-Reduce programming framework to the internal re-distribution mechanism of tuples in MPP databases. The Map-Reduce programming framework is innovative because it provides a way of attaining parallelism for arbitrary computations. The internal MPP DB tuple re-distribution mechanisms operated on one-tuple at a time with a static hash function that had the number of partitions statically pre-defined. The mechanism could not be re-used by users or database applications – in fact, it could not be re-used even by stored procedures that were part of the MPP DB framework.
If you are interested, please look at the Related Work section of our VLDB 2009 conference paper. http://www.asterdata.com/resources/downloads/whitepapers/sqlmr.pdf
Steve Wooledge,
I am flattered that you confused me for David DeWitt and Stonebraker 🙂
They are the ones who are quoted as saying MapReduce wasn’t something new. MapReduce is a creation of Ghemawat and Dean.
All I’m saying is that recent claim by many that they’ve been “doing MapReduce all along” are not entirely true (and not entirely false either).
I’m not equating MR with the MPP redistribution framework, hence my comment that reads “… the simple answer is this: they have been doing something VERY MUCH LIKE MapReduce all along”.
Thanks,
-amrith
[…] Kritische Stimmen zu MapReduce: Three big myths about MapReduce, DBMS2, October 18, 2009 […]
[…] DBMS2 takes a look at these three myths about mapreduce… * MapReduce is something very new * MapReduce involves strict adherence to the Map-Reduce programming paradigm * MapReduce is a single technology […]
This conversation makes me wonder if anyone has plans to extend MDX to include MR functions or context. After all, this was the language designed to handle multidimensional data as a standard.
[…] frustrated by a constant need — or at least urge — to correct myths and errors about MapReduce. Let’s try one more […]