MapReduce
Analysis of implementations of and issues associated with the parallel programming framework MapReduce. Related subjects include:
Why the Great MapReduce Debate broke out
While chatting with Mike Stonebraker today, I finally understood why he and Dave DeWitt launched the Great MapReduce Debate:
It was all about academia.
DeWitt noticed cases where study of MapReduce replaced study of real database management in the computer science curriculum. And he thought some MapReduce-related research papers were at best misleading. So DeWitt and Stonebraker decided to set the record straight.
Fireworks ensued.
Categories: MapReduce, Michael Stonebraker | 5 Comments |
Is MapReduce a good underpinning for next-gen scientific DBMS?
Back in November, Mike Stonebraker suggested that there’s a need for database management advances to serve “big science”. He said:
Obviously, the best solution to these … problems would be to put everything in a next-generation DBMS — one capable of keeping track of data, metadata, and lineage. Supporting the latter would require all operations on the data to be done inside the DBMS with user-defined functions — Postgres-style.
Categories: Data types, MapReduce, Scientific research | Leave a Comment |
A passionate defense of MapReduce
Mark Chu-Carroll has weighed in with a passionate defense of MapReduce. I only see one thing he got wrong, which was to overlook the great shared-nothing parallelism of today’s data warehouse appliances and specialty data warehouse DBMS. But that doesn’t detract from his overall point, which is that MapReduce is designed to help with parallel computing in general, not database querying in particular.
He also has the best version I know of an old observation, namely:
… [relational database] people have found the most beautiful, wonderful, perfect hammer in the whole world. It’s perfectly balanced – not too heavy, not too light, and swings just right to pound in a nail just right every time. The grip is custom-made, fitted to the shape of the owners hand, so that they can use it all day without getting any blisters. It’s also beautifully decorated – encrusted with gemstones and gold filigree – but only in places that won’t detract from how well it works as a hammer. It really is the greatest hammer ever. Relational database guys love their hammer. It’s just such a wonderful tool! And when they make something with it, it really comes out great. In fact, they like it so much that they think it’s the only tool they need. If you give them a screw, they’ll just pound it in like it’s a nail. And when you point out to them that dammit, it’s a screw, not a nail, they’ll say “I know that. But you can’t expect me to use a crappy little screwdriver when I have a magnificent hammer like this!”
Categories: Data models and architecture, Database diversity, MapReduce, Parallelization | Leave a Comment |
MapReduce for data mining? Maybe for variable-schema analytics.
Rich Skrenta is quite a successful entrepreneur, so it’s likely that he doesn’t really mean the more ridiculous parts of this rant on the MapReduce debate. E.g., he cheerfully disregards the fact that the data warehouse appliance vendors have ALREADY disrupted the market he’s focusing on. Index-light row-based and columnar systems are both super fast at data mining extracts.
But let’s go straight to the one interesting thing he said, Read more
Categories: Analytic technologies, MapReduce, Parallelization, SAS Institute | 2 Comments |
The Great MapReduce Debate
Google’s highly parallel file manipulator MapReduce has gotten great attention recently, after a research paper revealed:
- MapReduce is running the core Google search engine, plus much of Google Analytics and other applications.
- MapReduce is processing 400+ petabytes of data per month.
(Niall Kennedy popularized the paper and surveyed its results.)
David DeWitt and Mike Stonebraker then launched a blistering attack on MapReduce, accusing it of disregarding almost all the lessons of database management system theory and practice. A vigorous comment thread has ensued, pointing out that MapReduce is not a DBMS and asserting it therefore shouldn’t be judged as one.
While correct, that defense begs the question – what is MapReduce good for? Proponents of MapReduce highlight two advantages:
- MapReduce makes it very easy to program data transformations, including ones to which relational structures are of little relevance.
- MapReduce runs in massively parallel mode “for free,” without extra programming.
Based on those advantages, MapReduce would indeed seem to have significant uses, including: Read more
Categories: Cloud computing, MapReduce, Michael Stonebraker | 10 Comments |