Hadoop distributions
Elephants! Elephants!
One elephant went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Elephants! Elephants!
Two elephants went out to play
Sat on a spider’s web one day.
They had such enormous fun
Called for another elephant to come.
Elephants! Elephants!
Three elephants went out to play
Etc.
— Popular children’s song
It’s Strata week, with much Hadoop news, some of which I’ve been briefed on and some of which I haven’t. Rather than delve into fine competitive details, let’s step back and consider some generalities. First, about Hadoop distributions and distro providers:
- Conceptually, the starting point for a “Hadoop distribution” is some version of Apache Hadoop.
- Hortonworks is still focused on Hadoop 1 (without YARN and so on), because that’s what’s regarded as production-ready. But Hortonworks does like HCatalog.
- Cloudera straddles Hadoop 1 and Hadoop 2, shipping aspects of Hadoop 2 but not recommending them for production use.
- Some of the newer distros seem to be based on Hadoop 2, if the markitecture slides are to be believed.
- Optionally, the version numbers of different parts of Hadoop in a distribution could be a little mismatched, if the distro provider takes responsibility for testing them together.
- Cloudera seems more willing to do that than Hortonworks.
- Different distro providers may choose different sets of Apache Hadoop subprojects to include.
- Cloudera seems particularly expansive in what it is apt to include. Perhaps not coincidentally, Cloudera folks started various Hadoop subprojects.
- Optionally, distro providers’ additional proprietary code can be included, to be used either in addition to or instead of Apache Hadoop code. (In the latter case, marketing can then ensue about whether this is REALLY a Hadoop distribution.)
- Hortonworks markets from a “more open source than thou” stance, even though:
- It is not a purist in that regard.
- That marketing message is often communicated by Hortonworks’ very closed-source partners.
- Several distro providers, notably Cloudera, offer management suites as a big part of their proprietary value-add. Hortonworks, however, is focused on making open-source Ambari into a competitive management tool.
- Performance is another big area for proprietary code, especially from vendors who look at HDFS (Hadoop Distributed File System) and believe they can improve on it.
- I conjecture packaging/installation code is often proprietary, but that’s a minor issue that doesn’t get mentioned much.
- Hortonworks markets from a “more open source than thou” stance, even though:
- Optionally, third parties’ code can be provided, open or closed source as the case may be.
Most of the same observations could apply to Hadoop appliance vendors.
Besides code, Hadoop distribution providers commonly offer support. The Hadoop support situation is confused, largely because:
- Marketing around Hadoop support capabilities and experience is sparse …
- … except for the Hortonworks vs. Cloudera General Hadoop Expertise Urinary Olympics.
- I don’t hear a lot of complaints about anybody’s Hadoop support.
That said:
- One should distinguish between, say, Tier 1 and Tier 3 support.
- Since most serious Hadoop development is done by Cloudera and Hortonworks, those two vendors are by far the best qualified to do Tier 3+ support.
- Since Cloudera has the most Hadoop market share to date, it also has the most Hadoop support experience (any and all tiers).
- Some of the other contenders are huge companies that presumably know how to support enterprise customers. This includes both distro providers and others (e.g. Oracle, which sells a Cloudera-based appliance and handles Tier 1 support for that itself).
And finally, reasons that come to mind for choosing particular distributions include:
- Cloudera
- Cloudera Manager is (relatively speaking) mature.
- Cloudera Navigator seems promising.
- Cloudera has the most experienced Hadoop services operation.
- Cloudera has the development “axe” in some parts of Hadoop and is second only to Hortonworks in the others.
- Cloudera has lots of partner support.
- Cloudera is the best-funded company whose main business is Hadoop.
- Hortonworks
- With the arguable exception of Cloudera, Hortonworks has much more Hadoop expertise than any other outfit, including the development “axe” in a variety of areas.
- Hortonworks has lots of partner support.
- Hortonworks is the second-best-funded company whose main business is Hadoop.
- Because of its low reliance on proprietary code, Hortonworks has great “escapability”, and correspondingly weak pricing power vs. its customers.
- Intel
- Intel’s Hadoop performance hacks may be legit.
- Intel was evidently early in supporting Chinese Hadoop users.
- EMC/Pivotal/Greenplum
- If you want to use the Greenplum DBMS, using the Pivotal/Greenplum Hadoop distribution too would seem to be thematic.
- MapR
- At one point MapR seemed to have a performance advantage. I don’t know whether that’s still the case.
- IBM
- Some believe that IBM removes obstacles, and grants blessings of prosperity and wisdom.
Comments
5 Responses to “Hadoop distributions”
Leave a Reply
[…] “mainly just support” — as per my recent post on Hadoop distributions, almost everybody offers SOMETHING […]
[…] many recent Hadoop distributions and technologies, Pivotal HD integrates with SQL to facilitate its maximal usage by developers and […]
I don’t have a link, but the Hortonworks windows code for Hadoop was just added to the Apache Hadoop trunk. Making it true Hadoop again
[…] take on Hadoop distributions two years ago might offer context. Trivia question: What’s the connection between the song that begins that […]
“Five elephants went out one day,
Upon a spiders web to play,
They had such tremendous fun,
But the web it broke and they all fell down”
or
“Five little elephants went out to play,
Upon a spider’s web one day.
The web went creak, the web went crack,
And five little elephants came running back”
Lots of entrants into the Hadoop space, having tremendous fun. But the web could not support all of the players and they all fell down. Well, not exactly all. Still have Cloudera, MapR and the Consortium all still having fun.