Cloudera Enterprise and Hadoop evolution
I talked with Cloudera a couple of weeks ago in connection with the impending release of Cloudera Enterprise. I’d say:
- If you are or want to be a serious MapReduce user – and you’re past the “play around over the weekend” stage — you probably should have either:
- A serious non-DBMS MapReduce distribution.
- MapReduce integrated into your analytic DBMS.
- Both.
- The obvious choice for non-DBMS MapReduce is Hadoop.
- The obvious choice for a Hadoop distribution is Cloudera Enterprise.
- Cloudera Enterprise has three main aspects, in an inseparable bundle:
- Distributions for a double-digit number of open source projects. It’s nice having all that in one package – unless, of course, you like playing with Tinkertoys.
- Proprietary Cloudera code.
- Cloudera support.
- Cloudera says its proprietary code is and in the future is planned to be concentrated – at least in large part — on integrating open source technology with closed source products. This has the virtue of being targeted directly at that segment of the market which has proven it’s actually willing to pay money for software.
- Cloudera Enterprise areas of focus, now and in the presumed future, include:
- Core Hadoop engine, which Cloudera says is quite predictably and appropriately evolving more slowly than the tools around it.
- Development, management and administrative tools, including:
- Pig and Hive. Cloudera says >70% of Facebook Hadoop jobs are initiated through Hive, and the same is true of Yahoo and Pig.
- Connectivity to commercial tools.
- The product formerly known as “Cloudera Desktop.”
- Workflow, which in this context refers to letting you create a Hadoop application as a sequence of small steps, rather than forcing you to kluge it into being one unwieldy thing. At the moment, this is much less widely adopted than Pig and Hive, but Cloudera has high hopes for it, because of its obvious benefits in modularity and manageability.
- Quasi-DBMS technology. Besides Hive and Pig, this includes HBase. Cloudera says there has been considerable demand for HBase, and it is pleased that the project is now mature enough to ship. Cloudera stresses that it intends HBase not for OLTP, but as an adjunct to analytic processing. E.g., Cloudera suggests HBase would be a fine vehicle for replicating dimension tables across each node of a cluster.
- Data connectivity, e.g. to MySQL or to sensor log files.
- Cloudera Enterprise pricing is well below DBMS prices – not by a full order of magnitude, if I’m right about everybody’s quantity discount policies, but even so by a lot. Details are NDA.
Cloudera sometimes sends confusing signals about its beliefs and strategies. For example, one can get different stories depending on whether one talks to:
- Somebody at Cloudera who comes primarily from the user and open source communities.
- Somebody at Cloudera who has actually worked at a software company before.
But I predict that Cloudera will now stick for a while with more or less the strategy outlined above.
Naturally, we also talked about Hadoop adoption. Highlights of that part – no doubt somewhat biased towards Cloudera’s own customer base — included:
- Notwithstanding eBay’s prior skepticism about MapReduce, it is quoted saying nice things in a Cloudera press release, and has apparently become quite a large Hadoop user, starting out with a search-quality use case.
- Typical Hadoop deployment sizes are 10 nodes or so when experimenting, 80-500+ in production.
- 10 terabytes/node – I’m pretty sure Cloudera meant of user data — is not inconceivable, so a cost-conscious 500-node user could have 5 petabytes of data managed by Hadoop.
- Cloudera has half a dozen customers at the 75+ node production level.
- Web and financial services are the two vertical markets moving most aggressively into Hadoop production. The government is also in significant Hadoop production, but the details of that are classified.
- Web uses for Hadoop include:
- Clickstream – sessionization, etc. – that’s a super-mainstream use.
- Search – analyzing search attempts in conjunction with structured data.
- Machine learning (for ad serving, etc.).
- Financial services uses for Hadoop include:
- Internal trading rule enforcement/fraud detection.
- Complex ETL.
- Portfolio risk assessment (typically overnight).
None of this is inconsistent with previous surveys of Hadoop use cases.
Various users talked at the Hadoop Summit this week. I wasn’t there, and won’t write about their stories for now. That said, Twitter’s slide deck from same has some interesting stuff, including:
- 7 TB/day ETLed from MySQL.
- Petabytes-being-stored accordingly coming soon.
- Open sourcing their ETL tool Crane.
- 3-4X LZO compression at little CPU cost.
- HBase is a more usable for them than HDFS, which isn’t mutable enough.
- Pig = 5% of code and coding effort vs. vanilla Hadoop at 30% or less performance hit.
Comments
7 Responses to “Cloudera Enterprise and Hadoop evolution”
Leave a Reply
[…] Monash can tell us a bit more about Cloudera Enterprise. He actually mentions Financial services uses for Hadoop include: Internal trading rule […]
“Cloudera has half a dozen customers at the 75+ node production level.”
Using Hadoop (and M/R) makes sense only on the level of at least Facebook or LinkedIn scale (1000 servers +, 5-10Pb data +). Why do you think this technology was invented by Google and not by Walmart IT department? Walmart is huge as well but their data volume is nothing in comparison with Google’s one. If you do not have Petabytes of data (read if you are not Google/Yahoo/Microsoft) you better look for something more traditional: custom built software or existing commercial product.
One high end server with problem optimized software can easily beat 50-node Hadoop cluster on virtually any task.
[…] use a screwy version of MapReduce, where by “screwy” I mean anything that isn’t Cloudera Enterprise, Aster Data SQL/MapReduce, or some other implementation/distribution with a lot of supporting […]
[…] Hadoop. Cloudera cites financial services as one of two vertical markets adopting Hadoop most aggressively, with primary use cases being […]
[…] Hadoop. Cloudera cites financial services as one of two vertical markets adopting Hadoop most aggressively, with primary use cases being […]
[…] eBay likes Hadoop for certain tasks such as image analysis. (Edit: And analysis of search results.) […]
[…] in financial services are widely recognized; leading-edge firms already are deploying it. Cloudera cites financial services as one of two vertical markets adopting Hadoop most aggressively, with primary use cases being […]