Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production (this post).
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92.
The posts depend on each other in various ways.
My clients at Cloudera and Hortonworks have somewhat different views as to the maturity of various pieces of Hadoop technology. In particular:
- Cloudera introduced CDH 4* and Hortonworks introduced HDP 1*, both timed for the recent Hadoop Summit.
- CDH 4 is based mainly on Hadoop 2.0, which Cloudera says it has tested extensively.
- HDP 1 is based on Hadoop 1.0, on the theory that nobody has properly tested Hadoop 2.0, which is still characterized as “alpha”.
- CDH 4 boasts sub-second NameNode failover.
- Hortonworks is partnering with third parties such as VMware to address the high-availability problems caused by failover potentially taking several minutes.
- Hadoop 2.0 and CDH 4 also incorporate improvements to NameNode scalability, HDFS (Hadoop Distributed File System) performance, HBase performance, and HBase functionality.
- As does CDH 4, HDP 1 includes HCatalog, an extension of Hive technology that serves as a more general metadata store. (Edit: Actually, see the comment thread below.)
- Hortonworks thinks HCatalog is a big deal in improving Hadoop data management and connectivity, and already has a Talend partnership based on HCatalog. Cloudera is less sure, especially in HCatalog’s current form.
- HDP 1 includes Ambari, an Apache open source competitor to Cloudera Manager (the closed-source part of Cloudera Enterprise). Hortonworks concedes a functionality gap between Ambari and Cloudera Manager, but perhaps a smaller one than Cloudera sees.
- Hortonworks thinks Ambari being open source means better integration with other management platforms. Cloudera touts the integration features and integrations of Cloudera Manager 4.
- Nobody seems confident that MapReduce 2 is ready for prime time. While it’s in CDH 4, so is MapReduce 1.
*”CDH” stands, due to some trademarking weirdness, for “Cloudera’s Distribution including Apache Hadoop”. “HDP” stands for “Hortonworks Data Platform”.
The whole thing seems like a big example of Miles’ Law: Where you stand depends upon where you sit. Cloudera’s embrace of more advanced Apache Hadoop technology is accompanied by claims such as “We built a lot of this ourselves” and “We’ve already tested this stuff at length.” I find Cloudera’s claims credible, and look forward to Hortonworks’ near-future declarations that those Hadoop 2.0 features are “now” enterprise-ready.
For HCatalog, however, the situations are reversed.
For now, my views on selecting Hadoop distributions start:
- For most enterprises, the Hadoop distribution you should go with is still CDH.
- I think Cloudera and Hortonworks are headed for a duopoly in general-purpose Hadoop distributions, and Hortonworks may achieve rough parity sooner than Cloudera likes. But at the moment Cloudera still seems well ahead.
- The same partners who root for Hortonworks to beat Cloudera also point out that they have worked with Cloudera for longer than Hortonworks has even existed. So while those partners are a plausibility argument for Hortonworks catching up with Cloudera in the future, they don’t show a Hortonworks advantage at this time.
- I think it’s already too late in the history of Hadoop to commit to other variants, such as MapR. But there can be credible and useful claims of Hadoop functionality in products like, for example, the DataStax/Cassandra stack.
- The wild card here is Amazon, which in some ways can be said to have majority Hadoop market share all by itself. One of the week’s announcements was some kind of optional integration between MapR and Elastic MapReduce.
Comments
14 Responses to “Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that”
Leave a Reply
[…] Hadoop versions and distributions, and their readiness or lack thereof for production. […]
[…] Hadoop versions and distributions, and their readiness or lack thereof for production. […]
Thank you for the post. I believe this is one of, if not THE first comparison studies of Apache Hadoop commercial distributions.
We must note however, that while HCatalog is a big part of HDP, it is not shipped as part of either CDH4 or CDH3. (As reference, here is a link to the CDH package http://tinyurl.com/7tvs3qz.)
Thanks again, Curt…
Curt –
You write “As does CDH 4, HDW” – did you mean HDP there, or is HDW a feature in something?
HDW is a typo for HDP. I thought I’d fixed all the instances of that. Let me go back and search for more. 🙂
[…] made the marketing decision to include YARN in its June Hadoop distribution release anyway, but advised that it was for experimentation rather than […]
[…] ETL tools such as Talend. […]
[…] straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production […]
[…] interesting idea, and a good hook for my first shot at writing about HCatalog. Indeed, other than the Talend integration bundled into Hortonworks’ HDP 1, Teradata SQL-H is the first real use of HCatalog I’m aware […]
[…] I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of […]
theshieldinc.Com
Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that | DBMS 2 : DataBase Management System Services
http://www.promotionvoucher.co.uk/
Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that | DBMS 2 : DataBase Management System Services
cheap chair hire perth
Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that | DBMS 2 : DataBase Management System Services
http://Www.Promotioncodes.org.uk
Hadoop distributions: CDH 4, HDP 1, Hadoop 2.0, Hadoop 1.0 and all that | DBMS 2 : DataBase Management System Services