October 16, 2014
Cloudera’s announcements this week
This week being Hadoop World, Cloudera naturally put out a flurry of press releases. In anticipation, I put out a context-setting post last weekend. That said, the gist of the news seems to be:
- Cloudera continued to improve various aspects of its product line, especially Impala with a Version 2.0. Good for them. One should always be making one’s products better.
- Cloudera announced a variety of partnerships with companies one would think are opposed to it. Not all are Barney. I’m now hard-pressed to think of any sustainable-looking relationship advantage Hortonworks has left in the Unix/Linux world. (However, I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.)
- Cloudera is getting more cloud-friendly, via a new product — Cloudera Director. Probably there are or will be some cloud-services partnerships as well.
Notes on Cloudera Director start:
- It’s closed-source.
- Code and support are included in any version of Cloudera Enterprise.
- It’s a management tool. Indeed, Cloudera characterized it to me as a sort of manager of Cloudera Managers.
What I have not heard is any answer for the traditional performance challenge of Hadoop-in-the-cloud, which is:
- Hadoop, like most analytic RDBMS, tightly couples processing and storage in a shared-nothing way.
- Standard cloud architectures, however, decouple them, thus mooting a considerable fraction of Hadoop performance engineering.
Maybe that problem isn’t — or is no longer — as big a deal as I’ve been told.
Comments
15 Responses to “Cloudera’s announcements this week”
Leave a Reply
Seems as if a number of storage vendors are also trying to decouple Hadoop storage from processing. It’s an interesting collision of processing models.
I think importance of data locality depends on engine performance. Hive will process,let say, 5-10 mb per second per core in simple query (like where with high selecticity). I can assume that amazon s3 can provide a few dozens of megabytes per second per server. So we didnt lost too much.
For the same query Impala will do at least 100 mb per second per core. I doubt that s3 will feed 12 core server with 1.2 GB per second. More than that – even 10 gigabit network will not be enough.
@David Gruzman Agree. It is also hard to see how the decoupling really scales over time as analytic tools improve in speed. The S3 case you describe can lead to pathological performance bottlenecks on large data sets depending on how S3 distributes data.
David,
On Amazon, you would be surprised at scan thru put of Amazon Redshift. It can easily go to 2.5 GB/sec/nod and no it doesn’t use S3 storage for scanning the data. Hadoop is a different animal as compared to a analytical DBMS. If you have Hive like or RDBMS like workload, you should stick to a analytical RDBMS and you will be happy with the performance.
Hope this helps.
John,
It is exactly my point – RedShift gains it’s amazing speed because of local storage usage. You could not dream about 2.5 GB/Sec from S3…
Agreed, I think even with local storage , Hadoop falls short in performance for several other factors. The benchmarks run by likes of Impala are anything but real life scenarios. Looking deeper into HDFS components like block size, data co-location etc will make it much harder for Hadoop to solve interactive analytics. There is a clear use case for Hadoop and the eco system but lots of folks are being led to believe that Hadoop is here to replace Storage, processing as well as analytical DBMS.
Re: “I haven’t heard a peep about any kind of Cloudera/Microsoft/Windows collaboration.”
Good news! http://j.mp/1uuw2jD
We’re pleased to announce that Microsoft Azure is now a preferred and certified Cloudera cloud platform. Mike Olson just appeared on stage with Satya Nadella to demonstrate Cloudera in the Azure marketplace and discuss the benefits of the partnership.
@John: I agree that ‘traditional’ hadoop is not suitable for interactive analytics. But technologies like HDFS caching, parquet and heavy reliance on high amounts of memory make Impala an actual viable competitor in that area.
According to our benchmarks, it was at least in the same ballpark, performance-wise, as Vertica.
We’re doing some deeper tests on Impala now, and what bothers us the most, is its over-reliance on RAM. Some heavy in-memory queries can crash the entire impala server.
Note that Hive 0.14 also promises serious performance improvements, and even CRUD. We will be looking at that next.
What concerns me more and more about Cloudera, is their tendency to move to a closed-source walled garden.
Several clients I talked to, are looking for alternatives. With Hive going through a lot of development lately, and technologies like Spark around the corner, the value proposition of Cloudera is decreasing.
Furthermore, they have a per-node license. This hampers the one thing that hadoop is known for best: scalability. Your 30K license of last year becomes a 300K license today.
I’m still a big fan. I like CDH and I like Impala. But I feel more and more, Cloudera is best suited for big corporations with deep pockets.
Kris, There are workloads where one can use Impala or something similar to solve few problems. At the end ,once you look at all the features one may want in Impala (caching, parquet etc and in future WLM, full update capabilities etc) to be an alternative to the likes of Vertica or ParAccel or Redshift, it does start to look like a DBMS running on a distributed file system aka HDFS. Only difference is what FS are you running it on. The cost play is even more interesting, nothing beats AMAZON offerings for “Big data” which includes Redshift.
Agreed that Impala is simply a dbms on HDFS. It does integrate with YARN so it plays nicely in the Hadoop eco system.
About pricing. How can redshift be cheaper than free? Surely, installing Impala on aws is cheaper than those same aws nodes + redshift costs?
I think amazon also offers Impala as a service. Not sure of the pricing there.
@Kris, good comments, a couple things to highlight:
* Impala 2.0 removes dependencies on RAM; would be curious to hear about your experiences with the latest release.
* Cloudera also offers capacity (i.e. /TB) pricing.
* Amazon EMR does indeed offer Impala support:
http://aws.amazon.com/about-aws/whats-new/2013/12/12/announcing-support-for-impala-with-amazon-elastic-mapreduce/
@matt thanks for your reply.
My experience so far is inreed, impala 2 spills to disk when needed. But Some queries won’t execute anymore. I’m testing all tpcds queries, and about 54 of them succeed on in memory data. If the data won’t fit in memory any more, it’s way less queries that succeed. Will publish first results, and code soon. Would be happy to talk them through with you guys to validate way of working and setup.
Crashing the entire server is not an issue anymore since I correctly configured the memory limits. My bad.
Kris, Redshift doesn’t price HW and SW separately. Some of my customers using Redshift with 1 to 2 yrs commitment are finding that price performance of Redshift especially against Impala is one fourth. There is nothing FREE. Impala doesn’t really do expansive analytics and is missing many integrations and features for a complete analytical solution. Full disclosure,I am an independent consultant in Big data technologies and have helped some enterprise customers evaluate Impala, Redshift and some other big data technologies, hence shared my experience. I am sure Impala will improve over time but I am not convinced YET that it can match the price performance of systems like Redshift for workloads which are beyond TPC-DS or TPC-H.
Yarn is definitely promising but does it really help in resource allocation within Impala workload? I think the answer to that is no as of today!!!
Thanks.
[…] Cloudera Director, which for example launches cloud instances. […]