Cask and CDAP
For starters:
- Continuuity toured in 2012 and touted its “app server for Hadoop” technology.
- Continuuity recently changed its name to Cask and went open source.
- Cask’s product is now called CDAP (Cask Data Application Platform). It’s still basically an app server for Hadoop and other “big data” — ouch do I hate that phrase — data stores.
- Cask and Cloudera partnered.
- I got a more technical Cask briefing this week.
Also:
- App servers are a notoriously amorphous technology. The focus of how they’re used can change greatly every couple of years.
- Partly for that reason, I was unimpressed by Continuuity’s original hype-filled positioning.
So far as I can tell:
- Cask’s current focus is to orchestrate job flows, with lots of data mappings.
- This is supposed to provide lots of developer benefits, for fairly obvious reasons. Those are pitched in terms of an integration story, more in a “free you from the mess of a many-part stack” sense than strictly in terms of data integration.
- CDAP already has a GUI to monitor what’s going on. A GUI to specify workflows is coming very soon.
- CDAP doesn’t consume a lot of cycles itself, and hence isn’t a real risk for unpleasant overhead, if “overhead” is narrowly defined. Rather, performance drags could come from …
- … sub-optimal choices in data mapping, database design or workflow composition.
I’d didn’t push the competition point hard (the call was generally a bit rushed due to a hard stop on my side), but:
- Cask thinks it doesn’t have much in the way of exact or head-to-head competitors, but cites Spring and WibiData/Kiji as coming closest.
- I’d think that data integration vendors who use Hadoop as an execution engine (Informatica, Syncsort and many more) would be in the mix as well.
- Cask disclaimed competition with Teradata Revelytix, on the theory that Cask is focused on operational/”real-time” use cases, while Revelytix Loom is focused on data science/investigative analytics.
To reiterate part of that last bullet — like much else we’re hearing about these days, CDAP is focused on operational apps, perhaps with a streaming aspect.
To some extent CDAP can be viewed as restoring the programmer/DBA distinction to the non-SQL world and streaming worlds. That is:
- Somebody creates a data mapping “pattern”.
- Programmers (including perhaps the creator) write to that pattern.
- Somebody (perhaps the creator) tweaks the mapping to optimize performance, or to reflect changes in the underlying data management.
Further notes on CDAP data access include:
- Cask is proud that a pattern can literally be remapped from one data store to another, although I wonder how often that is likely to happen in practice.
- Also, a single “row” can reference multiple data stores.
- Cask’s demo focused on imposing a schema on a log file, something you might do incrementally as you decide to extract another field of information. This is similar to major use cases for schema-on-need and for Splunk.
- For most SQL-like access and operations, CDAP relies on Hive, even to external data stores or non-tabular data. Cask is working with Cloudera on Impala access.
Examples of things that Cask supposedly makes easy include:
- Chunking streaming data by time (e.g. 1 minute buckets).
- Encryption.
- Generating database stats (histograms and so on).
Tidbits as to how Cask perceives or CDAP plays with other technologies include:
- Kafka is hot.
- Spark Streaming is hot enough to be on the CDAP roadmap.
- Cask believes that its administrative tools don’t conflict with Cloudera Manager or Ambari, because they’re more specific to an application, job or dataset.
- CDAP is built on Twill, which is a thread-like abstraction over YARN that Cask contributed to Apache. Mesos is in the picture as well, as a YARN alternative.
- Cask is seeing some interest in Flink. (Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded.)
Cask has ~40 people, multiple millions of dollars in trailing revenue, and — naturally — high expectations for future growth. I neglected, however, to ask how that revenue was split between subscription, professional services and miscellaneous. Cask expects to finish 2015 with a healthy two-digit number of customers.
Cask’s customers seem concentrated in usual-suspect internet-related sectors, although Cask gave it a bit of an enterprise-y spin by specifically citing SaaS (Software as a Service) and telecom. When I asked who else seems to be a user or interested based on mailing list activity, Cask mentioned a lot of financial services and some health care as well.
Related link
- Cask doesn’t have the obvious .com URL.
Comments
5 Responses to “Cask and CDAP”
Leave a Reply
After reading a bit of sources I got to the filling that it is something like Posix – giving standard access to the various capabilities with different implementations. Is it right?
What also interesting – if it is possible some way wrap existing zoo of MR jobs, Spark jobs, hive scripts etc into CDAP in automatic or semi-automatic manner?
Very good insights on Cask and CDAP, but small comment that Apache Flink is not simply alternative of Spark, it is more like alternative to MapReduce to do distributed data processing outside the MapReduce paradigm the right way. Both Flink and Spark try to solve similar problem but tackle them is very different ways. Simple Google about Flink should help give more insights the differences.
The origin of both projects almost start at the same time and just happen that Spark went to ASF first and most initial contributor reside in US so the project get more exposure and usages.
Henry,
Your description of Flink is also a description of Spark. So Flink is indeed an alternative to Spark. 🙂
As for why Spark is winning — that has a lot to do with Cloudera’s embrace, and also with the happy good fortune of stumbling into the streaming use case, which Mike Franklin of course was in an excellent position to recognize when it arose.
Hi Curt, thx for the response.
I did not mean that the analogy is wrong, but what I was saying the way Flink was created was never meant to be “alternative” of Spark bc both ideas came up at similar time in different places. I believe any new concept or solution to solve large data via distributed systems to skip the MapReduce limitations should be embraced and not dismissed it as “unneeded”.
It is actually similar to what happened when Spark first came out, lots of people said it is not needed because we already have Hadoop, and look at it now =)
Why Spark is winning? I would not say it is winning, it is more like popular, and as we all know from high school, popularity contest never good indicator of success later in life ^_*
I am not even sure what does winning mean here? More than one alternatives for solving problems always good for consumers/developers.
Enough about Flink from me, the blog was about Cask and CDAP and I would love to see more great stuff coming from them ^_^
[…] of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL […]