November 19, 2015
CDH 5.5
I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:
- Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
- Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
- When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
- Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
- Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
- More “platform” stuff from the Hadoop stack (e.g. for data ingest).
- Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.
While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:
- Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
- Hundreds of nodes.
- 10s of simultaneous queries in dashboard use cases.
- 1 – 3 million queries/month as a common figure.
Cloudera also expressed the opinions that:
- An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
- It is common for Impala customers to use Hive for “data preparation”.
- SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
- SparkSQL’s main use cases are (and these overlap heavily):
- As part of an analytic process (as opposed to straightforwardly DBMS-like use).
- To persist data outside the confines of a single Spark job.
Categories: Benchmarks and POCs, Cloudera, Data warehousing, Databricks, Spark and BDAS, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, SQL/Hadoop integration
Subscribe to our complete feed!
Comments
4 Responses to “CDH 5.5”
Leave a Reply
[…] introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I […]
Does donated to Apache mean that Impala and Kudo will have an open development process?
Mark,
When I talk to vendors, there’s one common difference between discussions of closed- and open-source — open source vendors are more forthcoming about planned developments, because it’s public information anyway.
By that standard, I’d say Kudu from the getgo has felt as open as anything else in Hadoop, and Impala has recently started to feel more open to me as well.
[…] View the original Post […]