August 25, 2013

Cloudera Hadoop strategy and usage notes

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- Search.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
Stream processing (Storm) is next in line.
Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
Charles is also seeing at least POC interest in Spark.
But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:

From general-purpose data stores, mainly RDBMS, analytic or otherwise. This sounds similar to Hortonworks’ views about efficiency-oriented offloading; batch work can be moved to Hadoop, saving costs and/or getting more mileage from costs that are already sunk into expensive legacy installations. The top targets here are large, centralized systems, with Teradata being a clear #1 and IBM mainframes a probable #2, but anything from Oracle to newer parallel analytic RDBMS is fair game.
From the specialized data stores associated with fuller technology stacks. The example I had in mind was Splunk; Charles added Palantir, HP Arcsight and, in the past, Endeca. The idea here is that Hadoop is used to organize and/or index data the way those products’ native data stores would, but in higher volumes than they are (cost-)effective for.

On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.

Related links

In connection with its 0.1 version, Jakob Homan of LinkedIn contrasted Giraph to MapReduce-based graph processing.
I wrote a series about graph processing in May, 2012.
MPI used to be a higher Hadoop priority (August, 2011). That’s why I’ve kept bringing it up.

Categories: Cloudera, Databricks, Spark and BDAS, Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Streaming and complex event processing (CEP), Teradata

Subscribe to our complete feed!

Comments

22 Responses to “Cloudera Hadoop strategy and usage notes”

Cloudera Sentry and other security subjects | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:40 am

[…] I chatted with Charles Zedlewski of Cloudera on Thursday about security — especially Cloudera’s new offering Sentry — and other Hadoop subjects. […]
Hortonworks business notes | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:43 am

[…] Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera. […]
Norbert GC on August 25th, 2013 3:03 pm

I would definitely love to see also the MapR side of the things, just to make the full turn. I think with the exception of Spark nobody really succeeded to make batch and real-time live together, and this is really surprising.
Curt Monash on August 25th, 2013 3:26 pm

Do you think MapR has enough customers to be in a position to generalize about anything?
Norbert GC on August 26th, 2013 4:27 am

With all respect, yes I do. Perhaps they don’t have so many strategical alliances like Cloudera or Hortonworks, and perhaps they could be blamed for trying to distance from normal Hadoop way, but still they have very interesting peculiarities that others don’t. I think even more – there are certain features others are today copying out from MapR strategic lines, so I think anyway it is good to have a company thinking ahead or different.
Curt Monash on August 26th, 2013 10:54 am

Those are about 3 different subjects.

MapR has had some interesting ideas. But that doesn’t necessarily mean they’re a trustworthy source of insight into the Hadoop market or any other subject.

They’ve only screwed me over once, so they’re not on my blacklist of companies I don’t want to hear from. But it’s remarkable how many ordinarily pleasant/easy-going folks I know complain about the mismatch between MapR’s words and reality.
Charles Zedlewski on August 26th, 2013 4:58 pm

Thanks Curt. One small tweak is that while I’ve seen folks use Hadoop as the backing engine for Endeca, Arcsight, Splunk and Palantir systems, I don’t know if there’s customer savings going on there. It’s more been a scaling strategy for large deployments.

And yes, you’re correct both SAS and Revolution bypass MapReduce for their newer products in favor of their own execution frameworks which are more efficient for running stats and machine learning workloads. Just like with Impala by building a purpose built framework they are able to get 1-2 orders of magnitude gains in performance over a MR library type approach.
Curt Monash on August 27th, 2013 2:22 am

Charles,

When one starts drawing distinctions as to whether scaling & performance serve to make things cheaper, better, or just possible at all, the answer often is no more precise than a simple “Yes”. 🙂

That was the point of http://www.strategicmessaging.com/the-marketing-of-performance/2012/04/25/
Norbert GC on August 27th, 2013 3:12 am

So Kurt, about those three “Elephants”… To be honest, being from Europe, I’ve never heard about a Hortonworks implementation, and I think the way they are doing extension, I don’t see them entering this market in short time. So that being excluded, it remains only Cloudera or MapR. What I don’t like about Cloudera (and any normal 64k blocks HDFS), it cannot handle well (or at all) projects consisting of multiple (millions or billions) of small files. The second thing I don’t like about Cloudera is their support. They are really delaying answers, or not giving back any. Of course, there are left a lot of comparison to make which of HDFS and MapRFS is better, and one can have a lot of arguments here. I hope you don’t mind this little talk, I am as you said, just an ordinarily well-intended guy who wants to learn what is best for his company.
Curt Monash on August 27th, 2013 3:53 am

Norbert,

There’s little doubt that back in 2011 or so, the MapR butt-kick was useful for getting people more serious about accelerating HDFS improvements. I blogged about that at the time. But at this point I’m more concerned with who’s going to do a better job on Hadoop 2 than with who has workarounds/alternatives for various deficiencies of Hadoop 1.

Your dissatisfaction with Cloudera support is interesting & noted. Thanks.
Arun C. Murthy on August 27th, 2013 5:07 pm

Curt, quick note.

Pig, MapReduce and others can easily access ORC via HCatalog.

Thanks.
Curt Monash on August 27th, 2013 5:14 pm

Thanks, Arun.
Arun C. Murthy on August 27th, 2013 10:44 pm

Curt – I appreciate the your update to the post. Thanks.
Olivier Renault on August 28th, 2013 4:40 am

Norbert,

Hortonworks has landed in EMEA at the beginning of January 2013. We’re a team of around 20 people across Europe and happy to come and discuss with you whenever it is suitable.

If you look on our website over the next few weeks, you should see quite few of our European customers logo appearing.

Thanks,
Olivier
Chris Harris on August 28th, 2013 6:02 am

Norbert,

Hortonworks reached Europe earlier this year. We are quickly expanding and already have a technical field and support team in both the UK and Germany with customers all across Europe.

I lead the technical team in the region so feel free to reach out to me directly.

Chris
Marcel Kornacker on August 28th, 2013 7:59 am

Regarding ORC vs. Parquet:

While those two formats might superficially look similar, there is at least one major difference: Parquet implements ColumnIO’s (Dremel’s) repetition and definition levels to represent nested structure. ORCfile, in contrast, contains element counts at every level of a nested path; the latter means that ORCfile isn’t a true columnar format: in order to reconstruct a column a.b.c.d one also needs to read the element counts for a.b and a.b.c in addition to the data of a.b.c.d. With ColumnIO and Parquet it’s only necessary to read column a.b.c.d, which has the necessary information to reassemble the full structure embedded in it.

While this might sound like a minor difference, in practice it can have a big impact on performance: analytic models told us that for a structure only three levels deep you might end up doing something like 30% more random I/O. This was the reason why Cloudera and Twitter decided to create Parquet instead of sticking with Trevni (which also used ORCfile’s representation of nesting).
Angus Klein on August 28th, 2013 1:56 pm

Norbert, I run the Cloudera Customer Support team and I’m disappointed to hear that you’ve not had a good experience with us. I’d like to invite you to contact me directly so that we can address your concerns. Please email me at aklein@cloudera.com.

Regards,
Angus Klein
Cloudera Snr. Support Director
Owen O'Malley on August 29th, 2013 11:30 pm

It is a fair point that Parquet and ORC differ on their representations of deeply nested data structures (struct, map, list, and union), but in our experience most customers use simple column types. Furthermore I’d like to emphasize that ORC is a true columnar format, because it stores the data in column-major format. Hive has had RCFile, which is also columnar for several years.

Parquet chose to push the meta data about the records’ structure down to the leaves while ORC puts the data in the intermediate columns. For example, if a column is a list of structures that contain maps of string to strings, Parquet will replicate the information down to each of the leaf columns. ORC will store a single copy of the information in the parent column.

There are two consequences to Parquet’s format:
* Parquet’s record assembly is significantly more complex than ORC’s.
* Because the data from the intermediate columns are repeated in each leaf, Parquet files are significantly larger than ORC and require more IO (upwards of 30% in many cases). The difference is even larger if you turn off indexes in ORC to match the lack of indexes in Parquet.

To be fair, if the leaf column is inside of 3 levels of nested lists, then Parquet will have 2 integers (repetition and definition levels) and ORC will have 3. However, that is a fairly unusual table definition in enterprises.

Addressing Marcel’s comment about random IO, ORC makes a single pass through each stripe reading precisely the required bytes and merging adjacent byte ranges into a single large read. As a result there is no random IO. In IO benchmarks performed by a customer on Amazon S3, ORC does 10x better than RCFile because it does far fewer seeks.
Thomas W. Dinsmore on September 12th, 2013 5:45 pm

Concerning the partnership between Revolution Analytics and Cloudera, our next release (Revolution R Release 7) will run natively in Hadoop, with analytic processing distributed across the Hadoop cluster. Programs written in Revolution R will be fully portable to or from Hadoop, with minor changes.

Outperforming Mahout is not the point. Mahout is fairly strong in recommendation engines and clustering, but weak in predictive modeling; Mahout functionality complements R.
ClearStory, Spark, and Storm | DBMS 2 : DataBase Management System Services on September 30th, 2013 9:25 am

[…] so other data structures are obviously being flattened out. Naturally, ClearStory is also eyeing Parquet and ORCfile, and has particularly warm thoughts about the […]
Chris sanders on October 7th, 2013 12:45 pm

Curt, Is Cloudera paying you? If they are your clients you should stipulate this on every article you blog on to keep full disclosure.

Your posts do seem anti Mapr, Hortonworks, Pivotal, Intel and IBM and useful to know if we can separate opinion from fact.

My organization is a consumer of Hadoop we prefer technical facts and this we enjoy some of your non Hadoop posts.

Chris
Curt Monash on October 7th, 2013 4:00 pm

Chris,

You should assume that just about anybody is a past, present or potential future client. I’ve gotten tired of repeating it in every post. But yes, Cloudera has been a better/more consistent client than other Hadoop distro vendors.

That said, if I were to identify vendors who have shown visible anger and disappointment at what I’ve written about them, Cloudera might rank fairly high.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Cloudera Hadoop strategy and usage notes

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin