DataStax Enterprise 2.0
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair. 🙂
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j — an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
No matter what is going on at a node, I gather that data is stored in the same Cassandra file format, which DataStax calls CFS (Cassandra File System). Edit: Not true. See the follow-on post. DataStax stresses that a node can have a choice of at least two “personalities”, namely:
- Cassandra, which DataStax sometimes calls “real-time”, and which among other things seems to entail talking CQL (Cassandra Query Language).
- Hadoop, which DataStax sometimes calls “batch analytics”.
- (I’m not sure whether Solr is a third such choice. On the one hand, that would seem to be thematic; on the other hand, DataStax hasn’t actually said so to me.) Edit: It is. But the elasticity point below doesn’t include Solr.
New in DataStax 2.0, there’s elasticity between these “personalities”; you can fire up a different kind of processing on a node, while leaving the data untouched. DataStax wasn’t able to say what typical replication factors are for the data — e.g., is it 3 on Cassandra nodes plus 3 more on Hadoop nodes, or might the total be less than 6? I’m guessing it’s really 3 on Cassandra nodes, so as to get failure-tolerant RYW consistency, but Hadoop nodes might not necessarily bring the total up to 6.
Other NoSQL vendors portray Cassandra as likely to win when a cluster needs to be spread around multiple data centers, but not a major contender otherwise. DataStax disputes this, but does cite a need for “continuous availability” as a key driver of adoption.
As you’ve probably gathered by now, I like the core DataStax story — and indeed had some influence on it — but roll my eyes somewhat at the work-in-progress as to how it is phrased and told. The other regrettable fuzziness in DataStax messaging is around customer count. DataStax cites >140 “customers”, but that includes every last outfit that bought a single day of training. On the plus side, DataStax cites a firm figure of 45 employees, and has lots of production use cases it can talk about and extrapolate from.
In particular, DataStax cites customers in areas that include:
- In-game messaging, at a number of gaming companies, which sounds a lot like the application Facebook originally invented Cassandra for, before moving to HBase.
- Various kinds of e-commerce — retail, travel, hospitality. Specific uses include product catalogs (a classic dynamic schema use case), shopping carts (arguably ditto), and user-generated data (reviews, comments, whatever).
- Streaming media — 5-6 “mission-critical production users”, most famously Netflix. I gather this is yet another twist on e-commerce.
- Online ads and campaigns — e.g. Constant Contact.
- Sensor data — mainly one example of auto fleet management DataStax keeps mentioning.
Indeed, Netflix should probably be regarded as the single flagship Cassandra user, even ahead of Twitter (not a DataStax customer). Netflix recently wrote:
We now have over 55 Cassandra clusters in the cloud and are moving our source of truth from our Datacenter to these Cassandra clusters.
which compares pretty favorably to an earlier estimate of
7 clusters in production by end of 2011
Comments
5 Responses to “DataStax Enterprise 2.0”
Leave a Reply
Thanks for the information!
> DataStax wasn’t able to say what typical replication factors are for the data — e.g., is it 3 on Cassandra nodes plus 3 more on Hadoop nodes, or might the total be less than 6?
CFS is an HDFS implementation, so usually HDFS would not be in use at all. Technically you could have exactly 3 replicas and gain all the benefits of both systems, but you might want to have 1 analytics-only replica for isolation purposes.
Hi Stu,
So was I wrong to say that Cassandra runs over CFS?
nice overview, though CFS is an HDFS API equivalent distributed filesystem built on top of Cassandra and is only available in DSE. Cassandra itself runs on XFS or whatever on each node.
Re: twitter vs netflix as a cassandra user. Surely netflix does a lot with cassandra, but I think twitter is an interesting case as they 1) run on real metal and 2) have the largest number of nodes they run of anyone I’m aware of. Over 1000 according to https://dev.twitter.com/blog/cassie-scala-client-for-cassandra Twitter also has a fraction of the staff maintaining those clusters. So both are interesting.
> So was I wrong to say that Cassandra runs over CFS?
Right: it is the other way around: CFS is an HDFS replacement that is hosted on Cassandra: where HDFS has a Namenode and Datanodes to store the blocks and inodes of a distributed filesystem, CFS uses only a Cassandra cluster to do the same thing (albeit without transactional move/rename semantics).