Cloudera presents the MapReduce bull case
Monday was fire-drill day regarding MapReduce vs. MPP relational DBMS. The upshot was that I was quoted in Computerworld and paraphrased in GigaOm as being a little more negative on MapReduce than I really am, in line with my comment
Frankly, my views on MapReduce are more balanced than [my] weary negativity would seem to imply.
Tuesday afternoon the dial turned a couple notches more positive yet, when I talked with Michael Olson and Jeff Hammerbacher of Cloudera. Cloudera is a new company, built around the open source MapReduce implementation Hadoop. So far Cloudera gives away its Hadoop distribution, without charging for any sort of maintenance or subscription, and just gets revenue from professional services. Presumably, Cloudera plans for this business model to change down the road.
Much of our discussion revolved around Facebook, where Jeff directed a huge and diverse Hadoop effort. Apparently, Hadoop played much of the role of an enterprise data warehouse at Facebook — at least for clickstream/network data — including:
- 2 1/2 petabytes of data managed via Hadoop
- 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.)
- Ad targeting queries run every 15 minutes in Hadoop
- Dashboard roll-up queries run every hour in Hadoop
- Ad-hoc research/analytic Hadoop queries run whenever
- Anti-fraud analysis done in Hadoop
- Text mining (e.g., of things written on people’s “walls”) done in Hadoop
- 100s or 1000s of simultaneous Hadoop queries
- JSON-based social network analysis in Hadoop
Some Facebook data, however, was put into an Oracle RAC cluster for business intelligence. And Jeff does concede that query execution is slower in Hadoop than in a relational DBMS. Hadoop was also used to build the index for Facebook’s custom text search engine.
Jeff’s reasons for liking Hadoop over relational DBMS at Facebook included:
- Price. Hadoop is free. MPP relational DBMS generally aren’t.
- Re-purposed data transformation logic. Facebook has lots of code sitting around in, e.g., Python to massage various specific kinds of data on its site. This code is re-used in Hadoop for ETL/ELT/ELTL/whatever.
- Resource management. Amazingly, Jeff found it easier to build a custom Hadoop resource manager to deal with the 100s or 1000s of concurrent queries than to rely on the native capabilities of a DBMS.
- Schema flexibility. This is a subject I’ve been preaching about for years. When people interact with web sites, the best schema to store data from their interactions changes just as quickly as the nature of their possible interactions does. Of course, when you add new features to a website, you can capture anything you like on a glorified entity-attribute-value basis. (Actually, I guess it would be more like EventDescriptor-SessionIdentifierClue-Timestamp.) But evolving a relational schema rapidly enough to keep up is hard. Facebook found it easier to evolve its Hadoop-based data massagers instead. (I’ve usually suggested running with XML or an XML-like approach, but notwithstanding the case of Marklogic/OpenConnect , that’s not usually the way network analytics implementers choose to go.)
More generally, Jeff argues there are tasks better programmed in Hadoop than SQL. He generally leans that way when data is complex, or when the programmers are high-performance computing types who aren’t experienced DBMS users anyway. One specific example is graph construction and traversal; there seems to be considerable adoption of Hadoop for graph analysis in the national intelligence sector.
Comments
27 Responses to “Cloudera presents the MapReduce bull case”
Leave a Reply
Hi Curt,
Thanks for the interesting discussion on this topic. I would like to reiterate a few points I made in my comments yesterday. Why wouldn’t you use both SQL AND MapReduce? Asking if you should use SQL OR MapReduce is like asking if you should tie your left or right hand behind your back. SQL is very good at some things, and MapReduce is very good at others. Why not leverage the best of both worlds – use SQL for traditional database operations and MapReduce for richer analysis that cannot be expressed by SQL, in a single system.
While the DBMS-MapReduce comparison study notes that MapReduce requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, we have eliminated that hassle by providing both SQL and MapReduce capabilities. So essentially, our customers can maximize developer productivity, using SQL for regular data processing and MapReduce for richer analysis.
P.S. We recently blogged about our Enterprise-class MapReduce capabilities and noted the key advantages that a system like ours provides over a pureplay MapReduce implementation – http://www.asterdata.com/blog/index.php/2009/04/02/enterprise-class-mapreduce/
Here are even more examples of why you would want to use both SQL and MapReduce: http://www.asterdata.com/blog/index.php/2009/03/13/sqlmapreduce-faster-answers-to-your-toughest-queries/
Thanks,
Steve
Steve, how many times are you gonna post these links already?!? 🙂
I set up the Cloudera distribution on amazon ec2 just to play with. It was dirt easy to install and get running. Also the Cloudera online training is superb. If you want to get your feet wet, it’s a great learning tool. They let you download a nice vm image to practice on too, if you don’t want to jump through ec2 hoops.
Other nice thing about is it comes with Streaming, Hive and Pig preinstalled as well so you can try the whole stack. It’s pretty kickass actually.
Performance on Ec2 seems pretty godawful though, but I am sure that has a lot to do with Ec2, also I didn’t want to shell out the couple hundred bucks it would have taken to try doing a Terasort on 1000 nodes. (-:
I could see a use for the Cloudera/EC2 thing if you had some intermittent needs that did not justify buying a bunch of iron. I believe the NYTIMES used it for some onetime image conversions, things like that.
Jerome,
I trained Steve to do that. 🙂 Think about how people find things on the web, through Google or otherwise. It makes sense.
And in this particular case, I’m devoting so much attention to the extreme views — “MapReduce is an inferior attempt to reinvent DBMS by whippersnappers who don’t know their history” vs. “MapReduce rules and anybody who doesn’t bow to its awesomeness is a hidebound old fogey” — that I’m not giving as many pixels on my own to Aster’s middle course as it deserves.
I was just teasin’ 🙂
Aster’s a client of yours right?
Hey Curt,
Thanks for the time yesterday, I enjoyed our chat. As I mentioned on the phone, we tried to take a middle ground at Facebook between MapReduce and databases as well. For the users who were interested in database-like storage and access, we built a system on top of Hadoop called Hive. Joydeep Sen Sarma, one of the main creators of Hive, recently gave a presentation about the software at IIT Delhi. Check it out at http://www.slideshare.net/jsensarma/hadoop-hive-talk-at-iitdelhi.
More generally, people are solving petabyte-scale problems with an open source data management solution, which is fundamentally exciting to me. Hadoop excels on the “problems solved per dollar spent” benchmark that mattered most to me as the manager of a business intelligence/data warehousing group.
Later,
Jeff
Indeed, Jerome. Aster and a double-digit number of their competitors are all clients. 🙂
Best,
CAM
In terms of a middle ground, has there been a convincing presentation on why in-database MR is better than MR where the file system has been partially replaced by a database connector?
The latter is currently available in simple form, although it could be tuned to take more advantage of an environment with partitioned/distributed data; getting a partition-enabled version up and running is relatively simple. This would seem to be a very simple way to combine the performance characteristics of multiple approaches.
Hans,
If you have an MPP DBMS, you generally want it to be in control of the nodes. So if you want to run MR on a cluster w/ locally attached disks that is managed by a DBMS, having the DBMS vendor’s cooperation is a big help.
CAM
Hmm, well… what kind of a big help? Like specifically what is it helping with? In cases like this, it seems common for a technology vendor to *imply* that some feature is better than it really is. For example, that embedded MR has a special relationship with the rest of the DBMS. But the *easiest* thing would be to embed a totally separate MR engine, and *call* it embedded MR. So I honestly have to ask… specifically, what is demonstrably better about embedded MR?
Assuming that one wants MR functionality to begin with, is embedded MR a major part of the value proposition, or just a nice to have?
Or should I look for the DBMS that best fits my other needs, knowing that I can add on MR to any DBMS with about equal results to embedded MR?
OK, I will take a stab…
The big help I could see is that if the map/reduce functionality was aware of how the data was already physically distributed across the distributed db nodes, it could be clever about picking a mapping function that minimized shuffling data around. (aka could perform collocated joins and things like that). Possibly it could tap into the distributed db query optimizer to help make choices
All the data shuffling and I/O is one of the weakness of map/reduce paradigm. The programmer is forced to be clever, and the programmer does not have access to the kind of heuristic data and finely tuned ruleset that a distributed db optimizer has.
However, I think there is a subtle difference between “aware of” and “embedded in” the distributed db. I think a map/reduce paradigm that was aware of physical db file location and layout, could access those files directly, and could query the optimizer for statistics and access paths, but still ran completely outside the distributed db ACID restrained paradigm might also be good. Let you use whatever language to hot the distributed db files directly, not limit you to SQL to access them.
There is also likely some opportunities around caching and reusing data across multiple similar job streams.
Just conjecturing, I think you would have to be a query optimization engineer to really answer this question and I am certain not one
@UnHolyGuy I am not at all clear that the benefits that you mention are available only to an embedded MR and not to an MR that runs on and queries each node.
I would not assume that something magical is happening behind the scenes with embedded MR. If there is some tangible benefit, it should be stated explicitly.
Hans,
To the extent possible, I want my database management system to have full control of the CPUs on which it runs and, most especially, the RAM they control.
That’s certainly true when I’m dealing with large data volumes.
The idea of running both a DBMS and another whole memory-consuming operation on the same node, contending for resources, doesn’t make much sense to me.
CAM
I don’t see this statement capturing the meat of the story. If you run MR on a DBMS node, it will compete with the DBMS. That’s the case whether you run a version of MR that’s bundled into the DBMS or a version that is stand-alone. The MR will be most efficient when it uses all the CPU power on the machine and as much RAM as it needs – thus potentially choking off queries whether it is embedded in the DBMS or not.
Just because a DBMS embeds an MR implementation does not mean that the DBMS works in some special way with MR to properly manage memory on the nodes. The memory used by the DBMS has no bearing on the memory allocated by MR. A vendor could easily slap in a totally independent MR implementation, where the DBMS and MR are barely aware of each other, and the MR needs to be tuned quite separately from the DBMS in order to cooperate.
So *maybe* a bundled implementation of MR will do something special with memory – but I would want to hear the vendor explain that before I go and assume it.
Again, I can write the code to embed an MR library and threading into any process – a web server, a database, a word processor. But that does not make it magically work better with the process in which I embed it. Vendors that want to sell MR as part of their value proposition should explain exactly what benefits we get from an embedded MR versus a stand-along MR. Otherwise, you might assume some benefits, only to buy the product and find that you do not get the benefits that you assumed.
@Hans lets say the data lives inside the database and is hashed across a hundred nodes using a hash key
in theory the map phase of a map reduce job should not be necessary if you were mapping by the same hash key
You should be able to go straight to combine phase without any data movement at all
A lot of the work that a distributed db query optimizer does is try to avoid redistributing data.
However i believe in current map/reduce implementations the map/reduce engine will select out the data and redistribute it across all the nodes even if that redistribution is logically equivalent to where the data started.
maybe i am misunderstanding how the map/reduce framework interacts with a relational db?
@Curt i think the important thing is to make sure that both the map reduce and dd engines have access to sufficient resources on each node. this can be done in naive fashion by throttling them both back to half a nodes worth of resources, bit wasteful though…
@UnHolyGuy You are talking about eliminating the data distribution; the map phase is still needed. It’s a misconception that MR requires moving data about; this comes from the implementation which uses a distributed file system. In fact, when data is already partitioned across many nodes, it is common to have the controller instruct each mapper to read only data on the node on which it runs. Many MPP DBMS’s have an API to allow one to query just one node (the node on which the mapper runs).
This leads to a major point about MR: the smarts in the controller and a proper way for the mapper to acquire data, make all the difference in performance. Good choices for querying and storing data that take maximum advantage of the existing storage topology can make night/day differences in MR performance. In fact, I think that over time people will come to see that some of Stonebraker’s results come from his very simplistic use of MR.
And maybe a DBMS vendor could put some of those smarts into their MR implementation, since the DBMS is very aware of the storage topology. But to date, I have not seen evidence of this being done. Which is why I say: there may be reasons why it’s better to embed MR functionality into an MPP DBMS. But simply because it *may* be, does not mean that it *is*. I would hate to buy a product because of its embedded MR, assuming that it has all kinds of optimization, only to find that it really does not do anything special or better than a stand-alone (and free) MR deployment.
ok we are saying the same thing then
The only M/R I have any familiarity with is Hadoop,
I also agree with you when you say
“And maybe a DBMS vendor could put some of those smarts into their MR implementation, since the DBMS is very aware of the storage topology. But to date, I have not seen evidence of this being done.”
I think the really smart thing to do with DDBMS Map/Reduce integration would be enhanced communication between the database and the M/R framework. I think you are building smarts into both the M/R framework and the DBMS at that point. I don’t think anyone has done this to date, though perhaps Greenplum is moving in that direction?
The other win would be for the M/R framework and the DBMS to jointly consider the entire multiphase execution plan of a set of map reduce jobs and try to optimize across the whole job stream rather then one job at a time. Greedy algorithms will only get you so far. I do not know if anyone is doing that either.
I think it is important to remember though that the place where distributed dbms has invested heavily and where those systems are the most “smart” is in the query plan optimization side of things. A system which is smart enough to “querying and storing data that take maximum advantage of the existing storage topology ” is probably an order of magnitude more complex then the base M/R framework itself….
Wait a moment! There’s a screwy assumption here (and I’ve been just as guilty of overlooking it as you guys).
There’s no way you can run Hadoop on the same machines as an MPP DBMS and minimize network usage the way you can integrating MR into the DBMS. The DBMS gets its results on various nodes, sends them to a head node, and ships them on to requesting program from there. The MPP DBMS — including one with MR extensions — ships data from peer to peer when it makes sense, in most cases never touching (or overburdening!) the head/master/queen node.
Probably depends on the DBMS, but in at least some cases each node has all the querying features of a regular, individual DBMS. That’s what I was thinking of. If you can’t query each node then that’s a different situation.
Hans,
If memory serves, Kognitio , Vertica, and Exasol don’t have master nodes. But most of the rest do.
CAM
Hans,
A key question is whether you need to push the data to the application, or push the application to the data. With huge volumes of data, you obviously want to avoid pushing these around as much as possible.
In the case of Greenplum, users can use SQL, MapReduce or a combination of both, and have this pushed to the data. i.e. In MPP database terms, the map step will run locally on each node (with direct access to the data), and the result will be ‘redistributed’ across the interconnect to the reduce steps. There’s no up-front movement of data, and the network movement that does occur is over the high-speed interconnect (i.e. multiple GigE or 10GigE connects per node).
That’s the simplest case. It gets really intesting when you start chaining Map-Reduce steps, or incorporating SQL. For example, you could do something like:
1. SELECT from a table in the database (or any arbitrary query)
-> Use this as the input to a MAP
-> Reduce the result
-> Use this as the input to another MAP
-> Reduce the result
2. Read from a large set of files across the filesystem
-> Use the as the input to a MAP
-> Reduce the result
-> Join the result against a table in the database (arbitrary query)
THEN 3. Join the results of 2 and 3 together
-> Use this as the input to a MAP
– Reduce the result and output to a table or the filesystem
What’s really cool is that this is all planned and executed as one pipelined parallel operation that makes full use of the parallelism of the system. No unnecessary materializing or moving of data, and you can make full use of the parallel hash join and aggregation mechanisms within our parallel dataflow engine.
The net of this: There are a lot of cases where Hadoop does the trick. However if you want to be able to do highly optimized parallel SQL (with full BI tools support for reporting) and MapReduce (for programmatic analyics) against the same data, you definitely want an engine that can do both. You get the ability to blend SQL and MapReduce. But more importantly you aren’t pulling Terabytes of data from one system to another before processing can even begin.
-Ben
[…] Facebook has 2 1/2 terabytes managed by Hadoop — without a DBMS! […]
Truviso proof of concept…
Truviso proof of concept A summary Scott sent out a few days ago: Scott Musson to swengineering show details Apr 21 (5 days ago) Reply…
I think the entire discussion bakes down to some simple points. M/R is simple to install, and fast to implement single functions. It has near zero enterprise functions. MPP DBs are hard to setup, and really start to shine with appropriate re-use of structures and enterprise methodology. Add to MPP DB that the structural foundation provided by declarative SQL enables much more independent third party BI tools (add 25 years of development time too).
Keep in mind that Map/Reduce is really only a marketing moniker now, and that these systems are really collections of parallel tools. The few implementation plans I’ve seen usually include join operators in the future for example. It’s growing and getting better, it is still years away from being as efficient as an MPP DB implementation – at least for my use cases.
It is my assertion that the MPP Database vendors entirely missed the impacts of their licensing schemes on web applications/companies. Think the Google or Facebook startup could have afforded Oracle – it would have cost more than the companies made in revenue. Not all applications fit the model of $$ value per transaction – necessity is what has driven the uptake in M/R.
We need open source MPP data management platforms – I specifically am not using DBMS, because there are large classes of analytics which do not lend themselves to relational technology. The closest thing we have to an efficient MPP programming API is OpenMPI and it’s brethren, which makes an MPP Database look like child’s play, much less something as simple as M/R.
I think these technologies are going to merge, they both bring needed things to the table. Look for practical companies bringing these things together.
I think some folk have gotten the idea …
MR is a platform for large scale, parallelized computation. We’ve always dawn a distinction between ETL tools and DBMSs, and MR/Hadoop is probably better conceived of as a next generation ETL platform. Why? Because it has no query capabilities, no indexing, poor tools support, etc.
Which is a fine thing, to be honest. Programming models for parallel computation have always been tough. Looks like MR has hit a nice balance between power and simplicity.
[…] Updating the metrics in my Cloudera post, […]
[…] they needed. From either the technology or cost perspectives. As Cloudera’s Jeff Hammerbacher related to Curt Monash, Hadoop enjoyed advantages over commercial relational alternatives for Facebook, […]