Alternatives for Hadoop/MapReduce data storage and management
There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.
Known HDFS and Hadoop data storage and management issues include but are not limited to:
- Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
- HDFS compression could be better.
- HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
- Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
Different entities have different ideas about how such deficiencies should be addressed.
For most practical purposes, Yahoo’s and IBM’s views about Hadoop have converged. Yahoo and IBM both believe that Hadoop data storage should be advanced solely through the Apache Hadoop open source process. In particular:
- IBM and Yahoo both talk of the great undesirability of Hadoop “forking” like Unix did.
- Yahoo appeared on stage at IBM’s analyst event this week to reinforce the meeting-of-the-minds, even though there’s no IBM/Yahoo customer relationship involved.
- IBM has disclaimed any intention of providing its own Hadoop distribution, but even so is committed to selling lots of IBM InfoSphere BigInsights, which incorporates Apache Hadoop.*
- Yahoo has stopped offering its own Hadoop distribution, period.
*IBM is emphatic about ruling out marketing terms whose connotation it doesn’t like. IBM’s Hadoop distribution isn’t a “distribution,” because that might make it sound too proprietary; IBM’s Oracle emulation offering isn’t an “emulation” offering, because that might make it sound too slow; and IBM’s CEP product InfoSphere Streams isn’t a “CEP” product, because that might make it sound too non-functional.
Cloudera can probably be regarded as part of the Yahoo/IBM camp, some stern looks from IBM in Cloudera’s direction notwithstanding. Cloudera Enterprise — also an embrace-and-extend offering — remains the obvious choice for enterprises Hadoop users; meanwhile, nobody has convinced me of any bogosity in the “no forking” claim Cloudera makes for its free/open source Hadoop distribution. Indeed, when I visited Cloudera a couple of weeks ago, Mike Olson showed me a slide demonstrating that Cloudera might be supplanting Yahoo as the biggest ongoing contributor to Apache Hadoop.
EMC’s Data Computing Division, nee’ Greenplum, made a lot of Hadoop noise this week. Unlike Yahoo, IBM, and Cloudera, EMC really is forking Hadoop. I’m not talking with the EMC/Greenplum folks these days, but the whole thing was covered from various angles by Lucas Mearian, Doug Henschen, Derrick Harris, and Dave Menninger.
Another option is to entirely replace HDFS with a DBMS, whether distributed or just instanced at each node. DataStax is doing that with Cassandra-based Brisk; Hadapt plans to do that with PostgreSQL and VectorWise (edit: As per the comment below, Hadapt only plans a partial replacement of HDFS); and Netezza’s analytic platform has a Hadoop-over-Netezza option as well. Mike Olson objects to such implementations being called “Hadoop”; but trademark issues aside, those vendors plan to support a broad variety of Hadoop-compatible tools. Aster Data has long taken that approach one step further, by offering an enhanced version of MapReduce — aka SQL/MapReduce — over its nCluster DBMS. And 10gen offers a more primitive form of MapReduce with MongoDB, but probably wouldn’t position it as addressing a “MapReduce market” at all.
Comments
22 Responses to “Alternatives for Hadoop/MapReduce data storage and management”
Leave a Reply
Hi,
I’ve just unveiled my new large data set processing method called ‘distributed set processing’:
http://www.mysqlperformanceblog.com/2011/05/14/distributed-set-processing-with-shard-query/
10gen seems to be in the process of figuring out their MapReduce story – if the Hadoop integration plug in work in progress of any indication:
http://www.mongodb.org/display/DOCS/Java+Language+Center#JavaLanguageCenter-HadoopSupport
“First they ignore you, then they laugh at you, then they fight you, then you win.”
Imitation is the sincerest form of flattery.
EMC/Greenplum make me laugh. Talk about desperation.
Everyone is still missing a big part of the point I think. Hadoop is much more about an open architecture and commodity pricing then it is about “map/reduce”.
Map/Reduce is a means to an end, “let me process and store big data on the cheap”. Proprietary systems implementing map/reduce does not get you that unless they slash and burn their pricing models. This is happening, but nowhere near fast enough.
Imitating it with proprietary stuff is like rolling your own linux clone and trying to sell it. (Sun, anyone?)
Some other observations:
1: Found hdfs compression to be comparable to most of the big MPP DB’s, who are wildly optimistic in their claims. Gzip is gzip, lzo is lzo, there is a lot less magic there then the vendor’s claim.
2: HDFS is not really that young, first released n 2006 which makes it quite a bit older then most of it’s competitors. Realistically been under heavy load since what 2007 or so? It’s also quite stable, as is the java layer
3: The layers on top of hdfs (pig, hive, hbase) I agree are much younger and less stable
4: HDFS Raid is in prototype, under heavy use at facebook I think,, and will be out soonish from what i heard. Course bottleneck wack-a-mole probably applies there
5: MapR supposedly will address the namenode single point of failure (proprietary though, blah), think Cloudera is working on this as well
fatal.error,
Yes, Cloudera is working on namenode single point of failure.
As for compression — columnar compression seems to just work better than the more generic kinds.
yes in a lot of cases, but you can build columnar data files in hdfs if you want.
RCFiles
http://developer.yahoo.com/blogs/ydn/posts/2010/07/hbase_and_rcfile/
also hbase is basically a columnar store
If HBase has columnar compression yet, I’ve missed the news.
I don’t know about on HBase, one hive and in hdfs in general they have
For the record, Hadapt does not replace HDFS, but instead supplements it. HDFS still stores the unstructured data, and the structured data goes into the relational storage. Both storage systems sit underneath the same Hadoop codebase. What we’re saying is that Hadoop is already great for unstructured data, but in order for Hadoop to become the central system for all types of analysis (instead of today’s model where you need Hadoop and a traditional DBMS with a connector between them, thereby making it necessary to maintain both systems separately), it needs better support for relational data, which is where Hadapt comes in.
Thanks, Daniel! Fixed.
Curt,
“IBM’s Oracle Emulation Offering” isn’t called “emulation” because it isn’t an emulation any more than my English is an emulation.
If you want to slap a name on it call it “colloquial bi-lingualism” perhaps.
How about we hook up and I give you a deep dive one of these days to sort this out?
Cheers
Serge
SQL Architect, DB2 for LUW
Thanks Curt.
About the three vs. two copies, I think you can tell HDFS how many copies you want. That said, three is a good number for many reasons.
Speaking for myself, I think the whole HFS/Hadoop/etc suite is a wonderful thing and I sure hope there isn’t a major fork. Major forks are so counterproductive. The Emacs/XEmacs fork didn’t just lead to the need for duplication of effort. I think that people lose enthusiasm for working on an open-source project that has forked. They don’t want to enter someone else’s dysfunctional family.
“If HBase has columnar compression yet, I’ve missed the news”
I am not 100% sure but HBase stores data in HFile’s (one per column family). These HFiles are native HDFS sequence files and can be compressed. So, technically speaking it supports columnar compression by using native HDFS compression and splitting data into separate column families (you can can create one column family per each column, for example).
Vlad,
Either you’re forgetting http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/, or I misunderstood you. What you wrote doesn’t sound like columnar compression to me.
Serge,
Sure; I’d be up for a deep dive.
Curt, it is columnar compression because:
1. Data is stored in column families (not in rows)
and you can specify single column family per every column.
2. The standard HDFS compression is used to store column family files.
The only difference between “real” columnar compression and HBase compression – HBase (HDFS) use generic compression algorithms which, of course, are not so efficient as pfor-delta or dictionary compression.
On retrospect though I think Curt is correct in saying that MPP DBMS compression is more advanced then HDFS
If you think about how hadoop development is trending, it’s really only the uber big installations in the tens or hundreds of pedabytes that seem extremely invested in better compression algorithms or hdfs raid for that matter (Yahoo, Facebook come to mind)
For most rest of us hadoop users the cost of storage is still low enough that it’s easy to just throw more disk at the problem. When you are paying $300/Terabyte for disk you are not out there killing yourself over x3 vs x4 compression. On a 1 Pedabyte cluster the difference between x3 and x4 is $50K.
What is more hidden is the performance impact of course
Vlad,
I’ll stand by my definition of “columnar compression”. 🙂
RCFiles are not true columnar compression — they are PAX-style block-columnar (inside a block, everything is sorted out so that data is stored column by column instead of row by row, but blocks are still stored consecutively).
There is a “truer” columnar format for hadoop called Zebra, part of the Pig project, but for some reason it hasn’t found wide adoption.
HBase’s HFiles are neither sequence files (sequence files being a specific implementation format), nor are they columnar storage. The HFile format is described briefly here: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html . As you can see, this is very much not a columnar format. One could, in theory, force HBase to be columnar by creating a separate column family for each column, but people don’t tend to run HBase with more than a handful of CFs for various practical reasons that I am sure someone on the hbase-users list would be happy to delve into :).
[…] or they’ll integrate their own tweaked versions of Apache Hadoop into their products. Other approaches will surely exist, and some probably will thrive, but it doesn’t look like Hadoop’s going […]
[…] or they’ll integrate their own tweaked versions of Apache Hadoop into their products. Other approaches will surely exist, and some probably will thrive, but it doesn’t look like Hadoop’s going […]
[…] Hoy en día apenas existen alternativas importantes a Hadoop, si bien en algunos foros se ha hecho eco que se han encontrado cuellos de botella en el rendimiento de Hadoop, así como en la propuesta de paralelización MapReduce. En estos enlaces puedes profundizar en esta información: Fatiga de Hadoop: alternativas emergentes y Alternativas a MapReduce y Hadoop. […]