August 25, 2013

Cloudera Hadoop strategy and usage notes

When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.

HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.

Another good subject was offloading work to Hadoop, in a couple different senses of “offload”:

On a pickier note, I encouraged Charles to push back against Hortonworks’ arguments for ORC vs. Parquet. His first claim was that ORC at this time only works under Hive, while Parquet can also be used for Hive, MapReduce, etc. (Edit: But see Arun Murthy’s comment below.) I suspect this is a case where Hortonworks and Cloudera should just get over themselves, and either agree on a file format or wind up each supporting both of them. There’s a lot of DBMS-like tooling in Hadoop’s future, and I have to think it will work better — or at least run faster — if it can make reliable assumptions about how data is actually stored.

Related links

Comments

22 Responses to “Cloudera Hadoop strategy and usage notes”

  1. Cloudera Sentry and other security subjects | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:40 am

    […] I chatted with Charles Zedlewski of Cloudera on Thursday about security — especially Cloudera’s new offering Sentry — and other Hadoop subjects. […]

  2. Hortonworks business notes | DBMS 2 : DataBase Management System Services on August 25th, 2013 11:43 am

    […] Edit: I followed up on these efficiency-oriented use cases in a conversation with Cloudera. […]

  3. Norbert GC on August 25th, 2013 3:03 pm

    I would definitely love to see also the MapR side of the things, just to make the full turn. I think with the exception of Spark nobody really succeeded to make batch and real-time live together, and this is really surprising.

  4. Curt Monash on August 25th, 2013 3:26 pm

    Do you think MapR has enough customers to be in a position to generalize about anything?

  5. Norbert GC on August 26th, 2013 4:27 am

    With all respect, yes I do. Perhaps they don’t have so many strategical alliances like Cloudera or Hortonworks, and perhaps they could be blamed for trying to distance from normal Hadoop way, but still they have very interesting peculiarities that others don’t. I think even more – there are certain features others are today copying out from MapR strategic lines, so I think anyway it is good to have a company thinking ahead or different.

  6. Curt Monash on August 26th, 2013 10:54 am

    Those are about 3 different subjects.

    MapR has had some interesting ideas. But that doesn’t necessarily mean they’re a trustworthy source of insight into the Hadoop market or any other subject.

    They’ve only screwed me over once, so they’re not on my blacklist of companies I don’t want to hear from. But it’s remarkable how many ordinarily pleasant/easy-going folks I know complain about the mismatch between MapR’s words and reality.

  7. Charles Zedlewski on August 26th, 2013 4:58 pm

    Thanks Curt. One small tweak is that while I’ve seen folks use Hadoop as the backing engine for Endeca, Arcsight, Splunk and Palantir systems, I don’t know if there’s customer savings going on there. It’s more been a scaling strategy for large deployments.

    And yes, you’re correct both SAS and Revolution bypass MapReduce for their newer products in favor of their own execution frameworks which are more efficient for running stats and machine learning workloads. Just like with Impala by building a purpose built framework they are able to get 1-2 orders of magnitude gains in performance over a MR library type approach.

  8. Curt Monash on August 27th, 2013 2:22 am

    Charles,

    When one starts drawing distinctions as to whether scaling & performance serve to make things cheaper, better, or just possible at all, the answer often is no more precise than a simple “Yes”. 🙂

    That was the point of http://www.strategicmessaging.com/the-marketing-of-performance/2012/04/25/

  9. Norbert GC on August 27th, 2013 3:12 am

    So Kurt, about those three “Elephants”… To be honest, being from Europe, I’ve never heard about a Hortonworks implementation, and I think the way they are doing extension, I don’t see them entering this market in short time. So that being excluded, it remains only Cloudera or MapR. What I don’t like about Cloudera (and any normal 64k blocks HDFS), it cannot handle well (or at all) projects consisting of multiple (millions or billions) of small files. The second thing I don’t like about Cloudera is their support. They are really delaying answers, or not giving back any. Of course, there are left a lot of comparison to make which of HDFS and MapRFS is better, and one can have a lot of arguments here. I hope you don’t mind this little talk, I am as you said, just an ordinarily well-intended guy who wants to learn what is best for his company.

  10. Curt Monash on August 27th, 2013 3:53 am

    Norbert,

    There’s little doubt that back in 2011 or so, the MapR butt-kick was useful for getting people more serious about accelerating HDFS improvements. I blogged about that at the time. But at this point I’m more concerned with who’s going to do a better job on Hadoop 2 than with who has workarounds/alternatives for various deficiencies of Hadoop 1.

    Your dissatisfaction with Cloudera support is interesting & noted. Thanks.

  11. Arun C. Murthy on August 27th, 2013 5:07 pm

    Curt, quick note.

    Pig, MapReduce and others can easily access ORC via HCatalog.

    Thanks.

  12. Curt Monash on August 27th, 2013 5:14 pm

    Thanks, Arun.

  13. Arun C. Murthy on August 27th, 2013 10:44 pm

    Curt – I appreciate the your update to the post. Thanks.

  14. Olivier Renault on August 28th, 2013 4:40 am

    Norbert,

    Hortonworks has landed in EMEA at the beginning of January 2013. We’re a team of around 20 people across Europe and happy to come and discuss with you whenever it is suitable.

    If you look on our website over the next few weeks, you should see quite few of our European customers logo appearing.

    Thanks,
    Olivier

  15. Chris Harris on August 28th, 2013 6:02 am

    Norbert,

    Hortonworks reached Europe earlier this year. We are quickly expanding and already have a technical field and support team in both the UK and Germany with customers all across Europe.

    I lead the technical team in the region so feel free to reach out to me directly.

    Chris

  16. Marcel Kornacker on August 28th, 2013 7:59 am

    Regarding ORC vs. Parquet:

    While those two formats might superficially look similar, there is at least one major difference: Parquet implements ColumnIO’s (Dremel’s) repetition and definition levels to represent nested structure. ORCfile, in contrast, contains element counts at every level of a nested path; the latter means that ORCfile isn’t a true columnar format: in order to reconstruct a column a.b.c.d one also needs to read the element counts for a.b and a.b.c in addition to the data of a.b.c.d. With ColumnIO and Parquet it’s only necessary to read column a.b.c.d, which has the necessary information to reassemble the full structure embedded in it.

    While this might sound like a minor difference, in practice it can have a big impact on performance: analytic models told us that for a structure only three levels deep you might end up doing something like 30% more random I/O. This was the reason why Cloudera and Twitter decided to create Parquet instead of sticking with Trevni (which also used ORCfile’s representation of nesting).

  17. Angus Klein on August 28th, 2013 1:56 pm

    Norbert, I run the Cloudera Customer Support team and I’m disappointed to hear that you’ve not had a good experience with us. I’d like to invite you to contact me directly so that we can address your concerns. Please email me at aklein@cloudera.com.

    Regards,
    Angus Klein
    Cloudera Snr. Support Director

  18. Owen O'Malley on August 29th, 2013 11:30 pm

    It is a fair point that Parquet and ORC differ on their representations of deeply nested data structures (struct, map, list, and union), but in our experience most customers use simple column types. Furthermore I’d like to emphasize that ORC is a true columnar format, because it stores the data in column-major format. Hive has had RCFile, which is also columnar for several years.

    Parquet chose to push the meta data about the records’ structure down to the leaves while ORC puts the data in the intermediate columns. For example, if a column is a list of structures that contain maps of string to strings, Parquet will replicate the information down to each of the leaf columns. ORC will store a single copy of the information in the parent column.

    There are two consequences to Parquet’s format:
    * Parquet’s record assembly is significantly more complex than ORC’s.
    * Because the data from the intermediate columns are repeated in each leaf, Parquet files are significantly larger than ORC and require more IO (upwards of 30% in many cases). The difference is even larger if you turn off indexes in ORC to match the lack of indexes in Parquet.

    To be fair, if the leaf column is inside of 3 levels of nested lists, then Parquet will have 2 integers (repetition and definition levels) and ORC will have 3. However, that is a fairly unusual table definition in enterprises.

    Addressing Marcel’s comment about random IO, ORC makes a single pass through each stripe reading precisely the required bytes and merging adjacent byte ranges into a single large read. As a result there is no random IO. In IO benchmarks performed by a customer on Amazon S3, ORC does 10x better than RCFile because it does far fewer seeks.

  19. Thomas W. Dinsmore on September 12th, 2013 5:45 pm

    Concerning the partnership between Revolution Analytics and Cloudera, our next release (Revolution R Release 7) will run natively in Hadoop, with analytic processing distributed across the Hadoop cluster. Programs written in Revolution R will be fully portable to or from Hadoop, with minor changes.

    Outperforming Mahout is not the point. Mahout is fairly strong in recommendation engines and clustering, but weak in predictive modeling; Mahout functionality complements R.

  20. ClearStory, Spark, and Storm | DBMS 2 : DataBase Management System Services on September 30th, 2013 9:25 am

    […] so other data structures are obviously being flattened out. Naturally, ClearStory is also eyeing Parquet and ORCfile, and has particularly warm thoughts about the […]

  21. Chris sanders on October 7th, 2013 12:45 pm

    Curt, Is Cloudera paying you? If they are your clients you should stipulate this on every article you blog on to keep full disclosure.

    Your posts do seem anti Mapr, Hortonworks, Pivotal, Intel and IBM and useful to know if we can separate opinion from fact.

    My organization is a consumer of Hadoop we prefer technical facts and this we enjoy some of your non Hadoop posts.

    Chris

  22. Curt Monash on October 7th, 2013 4:00 pm

    Chris,

    You should assume that just about anybody is a past, present or potential future client. I’ve gotten tired of repeating it in every post. But yes, Cloudera has been a better/more consistent client than other Hadoop distro vendors.

    That said, if I were to identify vendors who have shown visible anger and disappointment at what I’ve written about them, Cloudera might rank fairly high.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.