February 7, 2008
Why the Great MapReduce Debate broke out
While chatting with Mike Stonebraker today, I finally understood why he and Dave DeWitt launched the Great MapReduce Debate:
It was all about academia.
DeWitt noticed cases where study of MapReduce replaced study of real database management in the computer science curriculum. And he thought some MapReduce-related research papers were at best misleading. So DeWitt and Stonebraker decided to set the record straight.
Fireworks ensued.
Comments
5 Responses to “Why the Great MapReduce Debate broke out”
Leave a Reply
I think another reason is that the people publishing papers about MapReduce have not been citing earlier research, particularly DeWitt’s earlier research. In the academic world, including such citations is very important, and leaving them out, especially when you have been told about them and thus are clearly aware of them (DeWitt has done that), would be considered unethical and improper, by the rules of academia. This makes a lot of sense, considering how academic careers are advanced, and so on.
Possibly the writers of the MapReduce papers don’t see it that way, but that’s just guessing.
Its about time. 😉 A few months back we had an interesting discussion about XLDB at Stanford and different technologies and approaches used to manage extra large data volumes. MapReduce is very flexible algorithm for certain types of processing, it has been around for many years (or should I say decades) and is pretty much embedded in couple database products. The interesting part many forget to mention: Its rather expensive and in many cases there are alternatives that can be executed much quicker and cheaper. e.g. Hash Joins
MapReduce is pretty much the 4-5th choice for a modern cost based optimizer. MapReduce has gotten ‘famous’ again because a few companies have custom built solutions based on it. There are public white papers on the internet about solutions that throw together 1800+ pizza boxes to perform MapReduce functions on large data volumes. While the processing rates at peak time are pretty impressive, the overhead of redistribution and non-scaleable steps is rather big leaving a lot to be desired about the parallel efficiency of such solutions.
If efficiency is the concern, shouldn’t that be measured in terms of work done per dollar, not work done per box? How much does a 1,000 node proprietary RDBMS cluster cost and what is the uptime for it?
Major factors in price/performance include:
Performance
Cost of boxes
Cost of software
Maintenance of boxes
Maintenance of software
Administrative cost
Energy cost (sometimes the biggest cost factor of all)
That said, if you have a task for which RDBMS work well, you probably should use RDBMS. Writing your own RDBMS-substitute is probably not the way to go, although opinions about that split among the web biggies: Google loves its MapReduce; Yahoo and eBay seem to favor commercial products; and Amazon seems to like lots of different database products, including its own SimpleDB.
Actually, I imagine the story is more mixed at all of them. E.g., I’m sure one could justify just about any label one wants for Yahoo’s use of MySQL.
CAM
With all the experience in massive parallel systems out there I don’t understand how people still overlook the most important factor: parallel efficiency. From the moment you throw workload at an MPP system, how many systems/cpus did fully participate in the work. A system with 1000++ cpus is very vulnerable for huge variations in throughput whenever you run below 100% PE. You can actually see that on the charts for the grep test. The MPP system reaches its processing peak only for a brief few seconds. That leads to massive throughput problems.
To have 1800 server (given that they are older generation) spend 200 secs for a simple pattern match on 10^10 records or about 1TB of data should be really disappointing. Any of the common MPP systems can match that throughput with 4-12 servers or nodes. Therefore they would have to be about 250x more costly to match the MapReduce configuration in cost. Similar for the sort.
Its all about the TCO. A good balance between CPU,IO and interconnect is required to make algorithms perform well.
Lets not forget the usage of such systems. Today’s advanced SQL implementations put all of that processing power at the finger tips of a data analyst – many of them without programming background. And with User defined functions you can add any map/reduce logic you desire with plain C/C++ code and make it available to all users of the rdbms.