“Enterprise-ready Hadoop”
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop (this post)?
- HBase 0.92.
The posts depend on each other in various ways.
Cloudera, Hortonworks, and MapR all claim, in effect, “Our version of Hadoop is enterprise-ready, unlike those other guys’.” I’m dubious.
- Hortonworks has considerably fewer features than Cloudera, along with less of a production or support track record. (Edit: HCatalog may be a significant exception.)
- I doubt Cloudera really believes or can support the apparent claim in its CDH 4 press release that Hadoop is now suitable for every enterprise, whereas last month it wasn’t.
- While MapR was early with some nice enterprise features, such as high availability or certain management UI elements — quickly imitated in Cloudera Enterprise — I don’t think it has any special status as “enterprise-ready” either.
That said, “enterprise-ready Hadoop” really is an important topic.
So what does it mean for something to be “enterprise-ready”, in whole or in part? Common themes in distinguishing between “enterprise-class” and other software include:
- Usable by our existing staff.
- Sufficiently feature-rich.
- Integrates well with the rest of our environment.
- Fits well into our purchasing and vendor relations model.
- Well-supported.
- Sufficiently reliable, proven, and secure — which is to say, “safe”.
For Hadoop, as for most things, these concepts overlap in many ways.
There are two major kinds of usability issues in Hadoop:
- Programming. Since the whole point of MapReduce is to make parallel programming be only slightly harder than the ordinary stuff, I’d say Hadoop has been enterprise-ready in this respect since Day 1. Hadoop demands good programmers; but it doesn’t demand great ones.
- Administration. It would be nice if Hadoop administration tools combined the best features of tools used to manage scientific clusters, clustered relational databases, clustered storage systems, and networks. They have a ways to go. But I think we’re already at the point that general cluster management challenges shouldn’t be a barrier to adopting Hadoop.
As for data management features — Hadoop isn’t across-the-board competitive with analytic relational DBMS. (And the same goes for HBase vs. short-request alternatives.) But the real question is whether its features are good enough for a variety of important tasks. And to that, the answer at many enterprises is an emphatic Yes.
When it come to integration:
- Hadoop generally runs on its own cluster, or in the public (generally Amazon) cloud, or in some cases on a cluster shared with another data management system. (E.g. DataStax/Cassandra, Hadapt/PostgreSQL, or IBM Netezza.) Anyhow, requiring a dedicated cluster isn’t a deal-breaker.
- Hadoop’s data integration/ETL story is already decent, and it’s getting better fast.
- Hadoop management tools are in the early days of being integrated into more general management tool environments. But I don’t see why the need for standalone management tools should be an enterprise deal-breaker.
- As for software running on top of Hadoop — pending future posts, I’ll just say that the ability to run anything analytic on Hadoop is being assembled fast, but performance is something that needs to be assessed on a case-by-case basis.
Hadoop is already a good match for most enterprises’ buying practices. A thankfully large fraction of them are already content with open source (or open core) subscription models. For the rest, there are always options like the Oracle appliance. In connection with that, Cloudera has been providing enterprise Hadoop support for a while, and now Hortonworks is getting into the game as well.
And so we circle to the final point, which intersects with most of the others — “Is this new-fangled Hadoop stuff safe?”
The story in unplanned downtime goes something like this:
- Hadoop has never crashed all that much.
- As of this month, pretty much anything that passes for a Hadoop distribution has an answer for Hadoop’s most famous single point of failure, the one at NameNode.
- HBase has added some capabilities in inter-data-center replication. (I’m not clear on the details.)
- Otherwise, formal disaster recovery for Hadoop seems more theoretical than practical.
For the most part, Hadoop use cases are either HBase or batch. For enterprise batch use, Hadoop’s reliability should already be fine. As for HBase — well, I’m not sure most enterprises would bet all that much on an 0.92 open source project with so little vendor sponsorship.
As for planned Hadoop downtime — theoretically, there should be very little; if you have a lot, it’s because your management tools and processes aren’t ideal. Temporary performance surprises may be harder to avoid, however, since Hadoop concurrency and workload management are still rudimentary, pending the maturity of MapReduce 2.
Hadoop security still seems pretty basic. Kerberos got in about a year ago, but I’ve only heard about role-based security and so on in the context of HBase, and that only in the latest release.
And finally, for the gut-feel question of proven — I think Hadoop is proven indeed, whether in technology, vendor support, or user success. But some particularly conservative enterprises may for a while disagree.
Comments
9 Responses to ““Enterprise-ready Hadoop””
Leave a Reply
Curt, fantastic post. Some comments:
“Anyhow, requiring a dedicated cluster isn’t a deal-breaker.” Can you elaborate here? From how I’m interpreting it, I’m not so sure I agree. You call out the DataStax and Hadapt models (which are distinctly different from one another, but, is anyone actually using either?) – and I’d lump HBase region servers in there as well – but even so, I haven’t seen anyone running a TaskTracker on their Tomcat or WebSphere servers. Have you? Would it not follow that Hadoop-ish clusters are thus ‘dedicated’? Even if they are, why is that a barrier to Enterprise adoption? Enterprises provision all sorts of stuff all the time (database appliances, for a contemporary example…).
I agree that Hadoop doesn’t crash ‘all that much’, but on each “distribution has an answer for Hadoop’s most famous single point of failure, the one at NameNode.” – at Hadoop Summit last week, Facebook attributed roughly 10% of their HDFS failures to NameNode HA issues (they have a solution too – if their solution didn’t exist, they’d go down 10% more of the time.) Go figure.
On HBase intra-DC “replication”, I find this post particularly useful in explaining how it works today and what the design assumptions are: http://www.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/
“Hadoop use cases are either HBase or batch” – I assume you mean what’s in the Apache Hadoop project, strictly speaking. Hadoop (HDFS+MR/YARN) is used in conjunction with ‘real-time’ data ingestion and analysis techniques left and right. If you consider Hadoop itself aside from these complementary building blocks, then yes, Hadoop is batch unless it’s HBase.
[…] In general, how “enterprise-ready” is Hadoop? […]
Joe,
I’m not disputing that Hadoop (usually) needs a dedicated cluster. I’m just saying that that need isn’t some kind of deficiency in enterprise-readiness.
I am not sure we (Hortonworks) agree that HDP1 has “considerably” fewer features. First, thank you for adding the note about HCatalog, but it is also important to note that we provide WebHDFS and data integration (via Talend) . Yes, you could download TOS4BD direct from Talend and use with any distribution, but the level of integration with HDP is deeper than with others. The technical relationship allowed us to share development and harden their support of HCatalog and Oozie with our engineering and test teams. The same is true with other partners who have chosen HDP because it allows for deeper integration with their offerings.
If we compare the two distributions they have very similar components. However, if you extend this to compare what is available for free open source download, I would say we are ahead. The cloudera management tool requires license. The Hortonworks Management Center is part of the core HDP download and is 100% open source.
[…] Monash, writing on The DBMS2 blog, addressed the enterprise readiness of Hadoop […]
The core premise of Hadoop is to enable complex analytics (i.e. not ‘just’ SQL queries) to happen at scale without breaking the bank, through the use of open source software and clustering potentially lots of commodity tin.
There is no doubt that this is a paradigm shift in the world of analytics, possibly the biggest since MPP databases came on the scene.
For those enterprises that don’t have a requirement of sufficient size/complexity for which Hadoop is the answer, the ‘enterprise readiness’ is a moot point.
Hadoop is likely to be a solution looking for a problem that doesn’t exist in many enterprises.
Inappropriate Hadoop adoption is a bigger issue than concerns over enterprise-readiness, mainly due to folks wanting to jump on the ‘big data’ bandwagon and the relativley low barriers to entry.
Paul,
Besides being (arguably) cheap, Hadoop is a highly flexible ETL tool. Dynamic schemas can make sense even in relatively low-volume use cases.
ninja-vouchers.Co.uk
Enterprise-ready Hadoop | DBMSÂ 2 : DataBase Management System Services
[…] In general, how “enterprise-ready” is Hadoop? […]