March 21, 2012

DataStax Enterprise 2.0

Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.

My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:

Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.

DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:

You can manage the interactive data for a web site.
You can store the logs for that website.
You can analyze all of the above in Hadoop.

No matter what is going on at a node, I gather that data is stored in the same Cassandra file format, which DataStax calls CFS (Cassandra File System). Edit: Not true. See the follow-on post. DataStax stresses that a node can have a choice of at least two “personalities”, namely:

Cassandra, which DataStax sometimes calls “real-time”, and which among other things seems to entail talking CQL (Cassandra Query Language).
Hadoop, which DataStax sometimes calls “batch analytics”.
(I’m not sure whether Solr is a third such choice. On the one hand, that would seem to be thematic; on the other hand, DataStax hasn’t actually said so to me.) Edit: It is. But the elasticity point below doesn’t include Solr.

New in DataStax 2.0, there’s elasticity between these “personalities”; you can fire up a different kind of processing on a node, while leaving the data untouched. DataStax wasn’t able to say what typical replication factors are for the data — e.g., is it 3 on Cassandra nodes plus 3 more on Hadoop nodes, or might the total be less than 6? I’m guessing it’s really 3 on Cassandra nodes, so as to get failure-tolerant RYW consistency, but Hadoop nodes might not necessarily bring the total up to 6.

Other NoSQL vendors portray Cassandra as likely to win when a cluster needs to be spread around multiple data centers, but not a major contender otherwise. DataStax disputes this, but does cite a need for “continuous availability” as a key driver of adoption.

As you’ve probably gathered by now, I like the core DataStax story — and indeed had some influence on it — but roll my eyes somewhat at the work-in-progress as to how it is phrased and told. The other regrettable fuzziness in DataStax messaging is around customer count. DataStax cites >140 “customers”, but that includes every last outfit that bought a single day of training. On the plus side, DataStax cites a firm figure of 45 employees, and has lots of production use cases it can talk about and extrapolate from.

In particular, DataStax cites customers in areas that include:

In-game messaging, at a number of gaming companies, which sounds a lot like the application Facebook originally invented Cassandra for, before moving to HBase.
Various kinds of e-commerce — retail, travel, hospitality. Specific uses include product catalogs (a classic dynamic schema use case), shopping carts (arguably ditto), and user-generated data (reviews, comments, whatever).
Streaming media — 5-6 “mission-critical production users”, most famously Netflix. I gather this is yet another twist on e-commerce.
Online ads and campaigns — e.g. Constant Contact.
Sensor data — mainly one example of auto fleet management DataStax keeps mentioning.

Indeed, Netflix should probably be regarded as the single flagship Cassandra user, even ahead of Twitter (not a DataStax customer). Netflix recently wrote:

We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters.

which compares pretty favorably to an earlier estimate of

7 clusters in production by end of 2011

Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics

Subscribe to our complete feed!

Comments

5 Responses to “DataStax Enterprise 2.0”

Jeremy on March 21st, 2012 4:35 am

Thanks for the information!
Stu Hood on March 21st, 2012 5:22 am

> DataStax wasn’t able to say what typical replication factors are for the data — e.g., is it 3 on Cassandra nodes plus 3 more on Hadoop nodes, or might the total be less than 6?
CFS is an HDFS implementation, so usually HDFS would not be in use at all. Technically you could have exactly 3 replicas and gain all the benefits of both systems, but you might want to have 1 analytics-only replica for isolation purposes.
Curt Monash on March 21st, 2012 7:41 am

Hi Stu,

So was I wrong to say that Cassandra runs over CFS?
Jeremy Hanna on March 21st, 2012 3:27 pm

nice overview, though CFS is an HDFS API equivalent distributed filesystem built on top of Cassandra and is only available in DSE. Cassandra itself runs on XFS or whatever on each node.

Re: twitter vs netflix as a cassandra user. Surely netflix does a lot with cassandra, but I think twitter is an interesting case as they 1) run on real metal and 2) have the largest number of nodes they run of anyone I’m aware of. Over 1000 according to https://dev.twitter.com/blog/cassie-scala-client-for-cassandra Twitter also has a fraction of the staff maintaining those clusters. So both are interesting.
Stu Hood on March 21st, 2012 3:53 pm

> So was I wrong to say that Cassandra runs over CFS?
Right: it is the other way around: CFS is an HDFS replacement that is hosted on Cassandra: where HDFS has a Namenode and Datanodes to store the blocks and inodes of a distributed filesystem, CFS uses only a Cassandra cluster to do the same thing (albeit without transactional move/rename semantics).

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

DataStax Enterprise 2.0

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin