DataStax Enterprise and Cassandra revisited
My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.
For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:
- Vanilla Cassandra.
- Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
- Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.
Cassandra, Solr, Lucene, and Hadoop are all Apache projects.
If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:
- Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
- There’s something called CQL (Cassandra Query Language), said to be SQL-like.
- There’s a JDBC driver for CQL.
- With Hadoop MapReduce also come Hive, Pig, and Sqoop.
- With Solr and Lucene come full-text search.
In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.
The two main ways to get direct SQL* access to data in DataStax Enterprise are:
- JDBC/SQL.
- Hive/Hadoop.
*or very SQL-like, depending on how you view things
Before going further, let’s recall some Cassandra basics:
- In a Cassandra column group, you have rowIDs and, associated with each rowID, a collection of (name, value) pairs.
- This is a lot like a relational table — albeit a denormalized and sparse one — where each name is like a column header.
- In addition, every name-value pair has a time-stamp.
- Cassandra has data typing — for example, there’s a concept of integer values.
- The Cassandra replication/consistency model starts with a generic quorum/consistent hashing approach. In addition:
- You can force replicas to go to different data center, or to different racks in the same data center.
- You can update remote data centers asynchronously, with only local quorums being required for reads or writes to go through.
- Alternatively, you can keep remote data centers in sync by having writes require a quorum from all data centers (whereas reads would only require a local quorum).
- Cassandra has no head node.
The story for Solr/Lucene indexing, beyond text search and so on, goes like this:
- Cassandra has a secondary indexing capability, but it insists on examining all nodes, and hence has acceptable performance only in certain use cases.
- But Lucene indexes:
- Don’t have that limitation.
- Are bitmapped.
- Let you do range queries (e.g. on integer data types or time stamps).
Notes on Hadoop-on-Cassandra include:
- CFS takes a 64 MB HDFS block and turns it into a 32-wide Cassandra row of 2 MB blocks.
- CFS doesn’t need special Hadoop NameNode data structures. Rather, metadata is stored in Cassandra column families, just as files are.
- This is not like the HBase/HDFS relationship. HBase runs on top of HDFS, while for Cassandra and CFS it’s the other way around.
DataStax emphasizes the point that DSE (DataStax Enterprise) lets you do multiple things on “the same cluster”, thus gaining operational simplicity. The essence of this claim is:
- You can have multiple “logical data centers” in one physical data center, each doing one of the things that DSE is capable of.
- You can run vanilla Cassandra or Hadoop-on-Cassandra nodes in the same logical data center, with a certain degree of interoperability or even elasticity.
- An imminent release of DataStax OpsCenter will let you manage multiple clusters together.
Vanilla Cassandra and Hadoop-on-Cassandra nodes can be combined in a single logical data center because they manage the same data structures. The two big gotchas in that are:
- Any CFS data can only reside on Hadoop-on-Cassandra nodes.
- Hadoop workloads can consume some of the resources of Hadoop-on-Cassandra nodes.
So in particular:
- Cassandra read and write quorums can include both vanilla Cassandra and Hadoop-on-Cassandra nodes.
- The Hadoop-on-Cassandra nodes will probably be slower to respond.
- Depending on the precise numbers involved, the slowness of Hadoop-on-Cassandra nodes in responding may or may not slow down general Cassandra response. (If we assume 2 out of 3 nodes are needed to respond, then having 1/3 of the nodes running Hadoop might not slow down overall Cassandra performance.)
By way of contrast, Solr-on-Cassandra nodes have additional data structures, specifically indexes, which is probably why they don’t have the same degree of interoperability with other kinds of nodes at this time. Solandra, not to be confused with Solyndra, is a different kind of Solr/Cassandra combination, without this problem. But in not using the Lucene indexes it has other issues, such as performance, and is no longer part of the DataStax offering.
On the business side, DataStax declines to follow-up on its figure of >50 subscription customers over a year ago, and merely cites a figure of 140ish total customers, which apparently includes every outfit that’s bought at least one day of training.
Comments
6 Responses to “DataStax Enterprise and Cassandra revisited”
Leave a Reply
[…] Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra. […]
Curt: Can you further clarify your third bullet under ‘Notes on Hadoop-on-Cassandra…’??
Not sure I follow your Cassandra/CFS analogy to HBase/HDFS.
Lastly, would you consider DataStax DSE to be a Hadoop Distribution since it utilizes MapReduce (but not HDFS); Therefore, similar to MapR replacing HDFS with NFS??
Michelle,
HBase is implemented as a layer on HDFS.
But CFS is implemented as a layer on Cassandra.
I try to stay out of the definitional jockeying as to which pieces of Hadoop are required before you can claim that something is a Hadoop distribution.
FWIW, Gartner considers DataStax Enterprise a Hadoop distribution.
I would, too, given that it’s the same API. Just one commenter’s opinion…
[…] Amazon) cloud, or in some cases on a cluster shared with another data management systems. (E.g. DataStax/Cassandra, Hadapt/PostgreSQL, or IBM Netezza.) Anyhow, requiring a dedicated cluster isn’t a […]