Incremental MapReduce
My clients at Cloudant, Couchbase, and 10gen/MongoDB (Edit: See Alex Popescu’s comment below) all boast the feature incremental MapReduce. (And they’re not the only ones.) So I feel like making a quick post about it. For starters, I’ll quote myself about Cloudant:
The essence of Cloudant’s incremental MapReduce seems to be that data is selected only if it’s been updated since the last run. Obviously, this only works for MapReduce algorithms whose eventual output can be run on different subsets of the target data set, then aggregated in a simple way.
These implementations of incremental MapReduce are hacked together by teams vastly smaller than those working on Hadoop, and surely fall short of Hadoop in many areas such as performance, fault-tolerance, and language support. That’s a given. Still, if the jobs are short and simple, those deficiencies may be tolerable.
A StackOverflow thread about MongoDB’s version of incremental MapReduce highlights some of the implementation challenges.
But all practicality aside, let’s return to the point that incremental MapReduce only works for some kinds of MapReduce-based algorithms, and consider how much of a limitation that really is. Looking at the Map steps sheds a little light:
- Map is a choice of how to parallelize the work.
- Incremental MapReduce forces you to parallelize the total work first of all by time, which may or may not be suitable.
For a sharper focus, let’s look at Reduce. Incremental MapReduce works if and only if Reduce steps can operate successfully on time-sliced data.
- If Reduce steps are done row-at-a-time (as for example in many kinds of data transformation), incremental MapReduce surely could work.
- If Reduce steps increment counters of some kind, incremental MapReduce surely could work.
- If Reduce steps look for patterns across a whole data set — well, that depends.
So what would be a situation in which Reduce doesn’t work if you split it up via time slices, but does work if you split it up in some other way? Three candidate categories come quickly to mind — graphs, time series, and joins. Let’s consider them in turn.
- Parallelizing graph analytics is hard no matter what you do. Making it harder yet doesn’t seem wise. I’ve never heard of an incremental MapReduce use case for graph analytics, and I’d be surprised if I ever heard of many.
- If you’re looking for patterns over time, then slicing your data in a simple-minded way by time makes it impossible to do that precisely (because there always will be patterns that could cross the time boundary). On the other hand, working in a sliding time window should be fine. So it all depends on whether the incremental MapReduce framework has sufficient flexibility to allow for reanalyzing some old data.
- As for joins — well, a DBMS vendor that implements joins via a general MapReduce framework is on the wrong track anyway.
Bottom line: Incremental MapReduce could serve a variety of purposes — but grafting it onto a DBMS is hardly a full replacement for Hadoop.
Comments
One Response to “Incremental MapReduce”
Leave a Reply
Curt,
As far as I know MongoDB’s MapReduce is not incremental. The thread on SO shows how to define *yourself* an incremental MR. In CouchDB’s case, which is at the origin of both Cloudant and Couchbase, views (the way CouchDB exposes MR) are by themselves incremental in the sense that all results of a MR run at a specific time are persistent and it is only future updates to the docs/additions that are processed when the view is re-requested. Note also that this is still taking place after a request to a view (basically it’s a pull model) and not automatically after an update/add (push model).
A://