Cloudera Hadoop strategy and usage notes
When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- Search.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:
- From general-purpose data stores, mainly RDBMS, analytic or otherwise. This sounds similar to Hortonworks’ views about efficiency-oriented offloading; batch work can be moved to Hadoop, saving costs and/or getting more mileage from costs that are already sunk into expensive legacy installations. The top targets here are large, centralized systems, with Teradata being a clear #1 and IBM mainframes a probable #2, but anything from Oracle to newer parallel analytic RDBMS is fair game.
- From the specialized data stores associated with fuller technology stacks. The example I had in mind was Splunk; Charles added Palantir, HP Arcsight and, in the past, Endeca. The idea here is that Hadoop is used to organize and/or index data the way those products’ native data stores would, but in higher volumes than they are (cost-)effective for.
On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.
Related links
- In connection with its 0.1 version, Jakob Homan of LinkedIn contrasted Giraph to MapReduce-based graph processing.
- I wrote a series about graph processing in May, 2012.
- MPI used to be a higher Hadoop priority (August, 2011). That’s why I’ve kept bringing it up.
Comments
22 Responses to “Cloudera Hadoop strategy and usage notes”
Leave a Reply
[…] I chatted with Charles Zedlewski of Cloudera on Thursday about security — especially Cloudera’s new offering Sentry — and other Hadoop subjects. […]
[…] Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera. […]
I would definitely love to see also the MapR side of the things, just to make the full turn. I think with the exception of Spark nobody really succeeded to make batch and real-time live together, and this is really surprising.
Do you think MapR has enough customers to be in a position to generalize about anything?
With all respect, yes I do. Perhaps they don’t have so many strategical alliances like Cloudera or Hortonworks, and perhaps they could be blamed for trying to distance from normal Hadoop way, but still they have very interesting peculiarities that others don’t. I think even more – there are certain features others are today copying out from MapR strategic lines, so I think anyway it is good to have a company thinking ahead or different.
Those are about 3 different subjects.
MapR has had some interesting ideas. But that doesn’t necessarily mean they’re a trustworthy source of insight into the Hadoop market or any other subject.
They’ve only screwed me over once, so they’re not on my blacklist of companies I don’t want to hear from. But it’s remarkable how many ordinarily pleasant/easy-going folks I know complain about the mismatch between MapR’s words and reality.
Thanks Curt. One small tweak is that while I’ve seen folks use Hadoop as the backing engine for Endeca, Arcsight, Splunk and Palantir systems, I don’t know if there’s customer savings going on there. It’s more been a scaling strategy for large deployments.
And yes, you’re correct both SAS and Revolution bypass MapReduce for their newer products in favor of their own execution frameworks which are more efficient for running stats and machine learning workloads. Just like with Impala by building a purpose built framework they are able to get 1-2 orders of magnitude gains in performance over a MR library type approach.
Charles,
When one starts drawing distinctions as to whether scaling & performance serve to make things cheaper, better, or just possible at all, the answer often is no more precise than a simple “Yes”. 🙂
That was the point of http://www.strategicmessaging.com/the-marketing-of-performance/2012/04/25/
So Kurt, about those three “Elephants”… To be honest, being from Europe, I’ve never heard about a Hortonworks implementation, and I think the way they are doing extension, I don’t see them entering this market in short time. So that being excluded, it remains only Cloudera or MapR. What I don’t like about Cloudera (and any normal 64k blocks HDFS), it cannot handle well (or at all) projects consisting of multiple (millions or billions) of small files. The second thing I don’t like about Cloudera is their support. They are really delaying answers, or not giving back any. Of course, there are left a lot of comparison to make which of HDFS and MapRFS is better, and one can have a lot of arguments here. I hope you don’t mind this little talk, I am as you said, just an ordinarily well-intended guy who wants to learn what is best for his company.
Norbert,
There’s little doubt that back in 2011 or so, the MapR butt-kick was useful for getting people more serious about accelerating HDFS improvements. I blogged about that at the time. But at this point I’m more concerned with who’s going to do a better job on Hadoop 2 than with who has workarounds/alternatives for various deficiencies of Hadoop 1.
Your dissatisfaction with Cloudera support is interesting & noted. Thanks.
Curt, quick note.
Pig, MapReduce and others can easily access ORC via HCatalog.
Thanks.
Thanks, Arun.
Curt – I appreciate the your update to the post. Thanks.
Norbert,
Hortonworks has landed in EMEA at the beginning of January 2013. We’re a team of around 20 people across Europe and happy to come and discuss with you whenever it is suitable.
If you look on our website over the next few weeks, you should see quite few of our European customers logo appearing.
Thanks,
Olivier
Norbert,
Hortonworks reached Europe earlier this year. We are quickly expanding and already have a technical field and support team in both the UK and Germany with customers all across Europe.
I lead the technical team in the region so feel free to reach out to me directly.
Chris
Regarding ORC vs. Parquet:
While those two formats might superficially look similar, there is at least one major difference: Parquet implements ColumnIO’s (Dremel’s) repetition and definition levels to represent nested structure. ORCfile, in contrast, contains element counts at every level of a nested path; the latter means that ORCfile isn’t a true columnar format: in order to reconstruct a column a.b.c.d one also needs to read the element counts for a.b and a.b.c in addition to the data of a.b.c.d. With ColumnIO and Parquet it’s only necessary to read column a.b.c.d, which has the necessary information to reassemble the full structure embedded in it.
While this might sound like a minor difference, in practice it can have a big impact on performance: analytic models told us that for a structure only three levels deep you might end up doing something like 30% more random I/O. This was the reason why Cloudera and Twitter decided to create Parquet instead of sticking with Trevni (which also used ORCfile’s representation of nesting).
Norbert, I run the Cloudera Customer Support team and I’m disappointed to hear that you’ve not had a good experience with us. I’d like to invite you to contact me directly so that we can address your concerns. Please email me at aklein@cloudera.com.
Regards,
Angus Klein
Cloudera Snr. Support Director
It is a fair point that Parquet and ORC differ on their representations of deeply nested data structures (struct, map, list, and union), but in our experience most customers use simple column types. Furthermore I’d like to emphasize that ORC is a true columnar format, because it stores the data in column-major format. Hive has had RCFile, which is also columnar for several years.
Parquet chose to push the meta data about the records’ structure down to the leaves while ORC puts the data in the intermediate columns. For example, if a column is a list of structures that contain maps of string to strings, Parquet will replicate the information down to each of the leaf columns. ORC will store a single copy of the information in the parent column.
There are two consequences to Parquet’s format:
* Parquet’s record assembly is significantly more complex than ORC’s.
* Because the data from the intermediate columns are repeated in each leaf, Parquet files are significantly larger than ORC and require more IO (upwards of 30% in many cases). The difference is even larger if you turn off indexes in ORC to match the lack of indexes in Parquet.
To be fair, if the leaf column is inside of 3 levels of nested lists, then Parquet will have 2 integers (repetition and definition levels) and ORC will have 3. However, that is a fairly unusual table definition in enterprises.
Addressing Marcel’s comment about random IO, ORC makes a single pass through each stripe reading precisely the required bytes and merging adjacent byte ranges into a single large read. As a result there is no random IO. In IO benchmarks performed by a customer on Amazon S3, ORC does 10x better than RCFile because it does far fewer seeks.
Concerning the partnership between Revolution Analytics and Cloudera, our next release (Revolution R Release 7) will run natively in Hadoop, with analytic processing distributed across the Hadoop cluster. Programs written in Revolution R will be fully portable to or from Hadoop, with minor changes.
Outperforming Mahout is not the point. Mahout is fairly strong in recommendation engines and clustering, but weak in predictive modeling; Mahout functionality complements R.
[…] so other data structures are obviously being flattened out. Naturally, ClearStory is also eyeing Parquet and ORCfile, and has particularly warm thoughts about the […]
Curt, Is Cloudera paying you? If they are your clients you should stipulate this on every article you blog on to keep full disclosure.
Your posts do seem anti Mapr, Hortonworks, Pivotal, Intel and IBM and useful to know if we can separate opinion from fact.
My organization is a consumer of Hadoop we prefer technical facts and this we enjoy some of your non Hadoop posts.
Chris
Chris,
You should assume that just about anybody is a past, present or potential future client. I’ve gotten tired of repeating it in every post. But yes, Cloudera has been a better/more consistent client than other Hadoop distro vendors.
That said, if I were to identify vendors who have shown visible anger and disappointment at what I’ve written about them, Cloudera might rank fairly high.