Hadoop futures and enhancements
Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:
- Cloudera’s proprietary code is focused on management, set-up, etc.
- The “Phase 1” plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2”.*
- MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
- So does Hadapt, but mainly vs. Hive.
- Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.
(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)
*Hortonworks, a new Hadoop company spun out of Yahoo, graciously permitted me to post a slide deck outlining an Apache Hadoop roadmap. Phase 1 refers to stuff that is underway more or less now. Phase 2 is scheduled for alpha in October, 2011, with production availability not too late in 2012.
You’ve probably heard some single point of failure fuss. Hadoop NameNodes can crash, which wouldn’t cause data loss, but would shut down the cluster for a little while. It’s hard to come up with real-life stories in which this has been a problem; still, it’s something that should be fixed, and everybody (including the Apache Hadoop folks, as part of Phase 2) has a favored solution. A more serious problem is that Hadoop is currently bad for small updates, because:
- Hadoop’s fundamental paradigm assumes batch processing.
- Both major workarounds to allow small updates are broken:
- HBase is seriously buggy, to the point that it sometimes loses data.
- Storing each update in a separate file runs afoul of a practical limit of 70-100 million files.
File-count limits also get blamed for a second problem, in that there may not be enough intermediate files allowed for your Reduce steps, necessitating awkward and perhaps poorly-performing MapReduce workarounds. Anyhow, the Phase 2 Apache Hadoop roadmap features a serious HBase rewrite. I’m less clear as to where things stand with respect to file-count limits.
Edits: As per the comments below, I should perhaps have referred to HBase’s HDFS underpinnings rather than HBase itself. Anyhow, some details are in the slides. Please also see my follow-up post on how well HBase is indeed doing.
The other big area for Hadoop improvement is modularity, pluggability, and coexistence, on both the storage and application execution tiers. For example:
- Greenplum/MapR and Hadapt both think you should have HDFS file management and relational DBMS coexisting on the same storage nodes. (I agree.)
- Part of what Hortonworks calls “Phase 2” sets out to ensure that Hadoop can properly manage temp space and so on next to HDFS.
- Perhaps HBase won’t always assume HDFS.
- DataStax thinks you should blend HDFS and Cassandra.
Meanwhile, Pig and Hive need to come closer together. Often you want to stream data into Hadoop. The argument that MPI trumps MapReduce does, in certain use cases, make sense. Apache Hadoop “Phase 2” and beyond are charted to accommodate some of those possibilities too.
Comments
20 Responses to “Hadoop futures and enhancements”
Leave a Reply
[…] Meanwhile, whatever else happens, I’m pretty psyched about some enhancements the Hortonworks folks plan to lead for Hadoop. […]
Do you get the following from the HortonWorks folks Curt?
“HBase is seriously buggy, to the point that it sometimes loses data.”
This seems like a strange statement given the scale of HBase deploys at the likes of facebook or even at Yahoo! itself with its 1k node cluster. I’m not saying HBase is without bugs nor that it might not lose data in the extreme, but the implication in your article is that hbase is ‘broken’. This seems ‘off’.
And this statement is complete news to me (and I believe to the others who are members of the apache hbase management committee): “…the Phase 2 Apache Hadoop roadmap features a serious HBase rewrite.”
Where’d that come from?
Let us know if you’d like the apache hbase committee’s pov next time you write about hbase Curt.
If this is what the Hortonworks guys are saying about HBase, then it is quite ironic because some of their founders are the same who have been resisting commits for 18+ months of the necessary HDFS support for HBase that avoids such problems in production at places like Facebook and anywhere there is a Cloudera Distribution installation. And the thought they are just going to take over HBase with a “serious rewrite” is laughable. We do welcome all contributions, except FUD and marketing or political games.
Andrew Purtell, HBase PMC
In fact I went so far as to do the legwork to integrate the 0.20-append branch changes with their 0.20.203-security branch, aka “Yahoo Hadoop” while these guys were still at Yahoo. Free labor and a contribution, and no meaningful response in return, at least, I haven’t heard anything back now for months. The fault lies not with HBase.
Having run a production hbase cluster, every single data loss scenario was due to either bug or fundamental flaws in hdfs. Lack of sync, nn crash, snn bugs, and so on. So I’m not sure where you are getting your intel that hbase has flaws… Obviously not from real users.
I would like to mention that Facebook is running proprietary version of HBase.
@Vlad: The changes of consequence are all upstream to my knowledge.
As an engineer at Facebook working with HBase, I can say that we are definitely not running a “proprietary” version of HBase. We have internal branches that are based off of Apache HBase 0.89 and 0.90 releases, but that contain some patches that are currently slated for 0.92/0.94 (I suspect most large-scale installations are running with a model similar to this). As Andrew says, all changes of consequence are contributed to and available from Apache.
Would love more context around the notes that “HBase is seriously buggy and loses data” and that someone is planning (without the knowledge of the HBase community) to do an HBase rewrite.
Hi,
Very nice article,
Cloudera has good funding. So they can survive for some more time.
Heard some layoffs at Cloudera so not sure about that.
Also learnt that Oracle is looking to buy some folks in BigData like,
– Cloudera or datastax.
Thanks, Nag
@Jonathan Gray
By “proprietary” I meant that nobody except Facebook engineers themselves right now is able to reproduce Facebook’s environment (HBase + HDFS). Internal branches + numbers of some patches (can we get the list of all these patches?). Can I get your Hadoop+HBase version as a down-loadable tarball as well?
@HBase guys,
Yeah, it’s fairer to say the HBase/HDFS combination is buggy, and HDFS is being corrected accordingly. To me, that’s a distinction without a difference, but I can see where you might feel differently. Sorry.
By all means reach out to me directly if you want to talk about HBase!
Thanks,
CAM
Good article. Question about coexistence of HDFS and Relational storage on the same nodes: how do you coordinate resource and workload management in this scenario?
Dave
Dave,
That’s a big part of the engineering challenge. Greenplum, Hadapt, et al. each have to have an answer that amounts to “our software is in charge”.
@Curt your comeback does insufficient redress. Your article has at least two assertions that are wrong and that we are trying to help you fix.
1. You say hbase is ‘broken’ yet hbase is deployed in a number of locations with at least two companies operating at large scales (Let me know if you’d like citations).
2. You talk of an hbase rewrite yet no one in the hbase community knows of what you talk.
Regards ‘reaching out to me directly’ to talk about HBase, isn’t that what we are doing here? Is there another channel you’d have us reach you on?
Hi Folks,
The Hortonworks team is not planning a HBASE rewrite or anything of the sort. We are talking about the fact that we believe Apache Hadoop 0.23 will be a great release for the HBASE community.
When I talked to Curt I talked about Apache Hadoop 0.20 and the improved sync/flush support that will be available in the next release of Apache Hadoop and other performance improvements. We did not discuss any failings of HBASE or any plans to rewrite it. The Yahoo HDFS team has done a complete rewrite of the HDFS write pipeline since 0.20. That is the rewrite we discussed.
Hortonworks is committed to making Hadoop a great platform for HBASE. HBASE is a huge and growing part of the Apache Hadoop community and the Hortonworks team is committed to working with the HBASE community. I’ve never said anything to the contrary.
Thanks,
E14
@Curt: When you say “HDFS is being corrected accordingly” you are speaking in the wrong tense. Based on evidence of production installs that have the fix in place, you should be speaking in past tense. This is a distinction with a difference.
email is a good channel for communication.
I think the Hadoop 0.23 release will be the best release of Hadoop yet for running HBase.
I still believe a substantially better release to run HBase on is MapR.
[…] Hadoop futures and enhancements | DBMS 2 : DataBase Management System Services […]
[…] turns out that my phrasing “HBase is broken” was inauspicious, for two reasons. The smaller is that something wrong with the HBase/Hadoop […]