March 27, 2012

DataStax Enterprise and Cassandra revisited

My last post about DataStax Enterprise and Cassandra didn’t go so well. As follow-up, I chatted for two hours with Rick Branson and Billy Bosworth of DataStax. Hopefully I can do better this time around.

For starters, let me say there are three kinds of data management nodes in DataStax Enterprise:

Vanilla Cassandra.
Cassandra plus Solr. Solr is a superset of the text-indexing system Lucene.
- Solr adds a lot more secondary indexing to Cassandra.
- In addition, these nodes serve as Solr emulation; you can run generic Solr apps on them.
Cassandra plus Hadoop.
- You can use Hadoop MapReduce to manipulate generic Cassandra data.
- In addition, these nodes serve as Hadoop/HDFS (Hadoop Distributed File System) emulation; you can run generic Hadoop apps on them.
- Hadoop jobs can interweave access to the two kinds of data structure.

Cassandra, Solr, Lucene, and Hadoop are all Apache projects.

If we look at this from the standpoint of DML (Data Manipulation Language) and data access APIs:

Cassandra is a column-group kind of NoSQL DBMS. You can get at its data programmatically.
There’s something called CQL (Cassandra Query Language), said to be SQL-like.
There’s a JDBC driver for CQL.
With Hadoop MapReduce also come Hive, Pig, and Sqoop.
With Solr and Lucene come full-text search.

In addition, it is sometimes recommended that you use “in-entity caching”, where an entire data structure (e.g. in JSON) winds up in a single Cassandra column.

The two main ways to get direct SQL* access to data in DataStax Enterprise are:

JDBC/SQL.
Hive/Hadoop.

*or very SQL-like, depending on how you view things

Before going further, let’s recall some Cassandra basics:

In a Cassandra column group, you have rowIDs and, associated with each rowID, a collection of (name, value) pairs.
- This is a lot like a relational table — albeit a denormalized and sparse one — where each name is like a column header.
- In addition, every name-value pair has a time-stamp.
- Cassandra has data typing — for example, there’s a concept of integer values.
The Cassandra replication/consistency model starts with a generic quorum/consistent hashing approach. In addition:
- You can force replicas to go to different data center, or to different racks in the same data center.
- You can update remote data centers asynchronously, with only local quorums being required for reads or writes to go through.
- Alternatively, you can keep remote data centers in sync by having writes require a quorum from all data centers (whereas reads would only require a local quorum).
Cassandra has no head node.

The story for Solr/Lucene indexing, beyond text search and so on, goes like this:

Cassandra has a secondary indexing capability, but it insists on examining all nodes, and hence has acceptable performance only in certain use cases.
But Lucene indexes:
- Don’t have that limitation.
- Are bitmapped.
- Let you do range queries (e.g. on integer data types or time stamps).

Notes on Hadoop-on-Cassandra include:

CFS takes a 64 MB HDFS block and turns it into a 32-wide Cassandra row of 2 MB blocks.
CFS doesn’t need special Hadoop NameNode data structures. Rather, metadata is stored in Cassandra column families, just as files are.
This is not like the HBase/HDFS relationship. HBase runs on top of HDFS, while for Cassandra and CFS it’s the other way around.

DataStax emphasizes the point that DSE (DataStax Enterprise) lets you do multiple things on “the same cluster”, thus gaining operational simplicity. The essence of this claim is:

You can have multiple “logical data centers” in one physical data center, each doing one of the things that DSE is capable of.
You can run vanilla Cassandra or Hadoop-on-Cassandra nodes in the same logical data center, with a certain degree of interoperability or even elasticity.
An imminent release of DataStax OpsCenter will let you manage multiple clusters together.

Vanilla Cassandra and Hadoop-on-Cassandra nodes can be combined in a single logical data center because they manage the same data structures. The two big gotchas in that are:

Any CFS data can only reside on Hadoop-on-Cassandra nodes.
Hadoop workloads can consume some of the resources of Hadoop-on-Cassandra nodes.

So in particular:

Cassandra read and write quorums can include both vanilla Cassandra and Hadoop-on-Cassandra nodes.
The Hadoop-on-Cassandra nodes will probably be slower to respond.
Depending on the precise numbers involved, the slowness of Hadoop-on-Cassandra nodes in responding may or may not slow down general Cassandra response. (If we assume 2 out of 3 nodes are needed to respond, then having 1/3 of the nodes running Hadoop might not slow down overall Cassandra performance.)

By way of contrast, Solr-on-Cassandra nodes have additional data structures, specifically indexes, which is probably why they don’t have the same degree of interoperability with other kinds of nodes at this time. Solandra, not to be confused with Solyndra, is a different kind of Solr/Cassandra combination, without this problem. But in not using the Lucene indexes it has other issues, such as performance, and is no longer part of the DataStax offering.

On the business side, DataStax declines to follow-up on its figure of >50 subscription customers over a year ago, and merely cites a figure of 140ish total customers, which apparently includes every outfit that’s bought at least one day of training.

Categories: Cassandra, DataStax, Hadoop, MapReduce, Market share and customer counts, NoSQL, Open source, Text

Subscribe to our complete feed!

Comments

6 Responses to “DataStax Enterprise and Cassandra revisited”

DataStax Enterprise 2.0 : DBMS 2 : DataBase Management System Services on March 27th, 2012 3:47 pm

[…] Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra. […]
Michelle Agul on March 27th, 2012 5:25 pm

Curt: Can you further clarify your third bullet under ‘Notes on Hadoop-on-Cassandra…’??
Not sure I follow your Cassandra/CFS analogy to HBase/HDFS.

Lastly, would you consider DataStax DSE to be a Hadoop Distribution since it utilizes MapReduce (but not HDFS); Therefore, similar to MapR replacing HDFS with NFS??
Curt Monash on March 27th, 2012 5:35 pm

Michelle,

HBase is implemented as a layer on HDFS.

But CFS is implemented as a layer on Cassandra.
Curt Monash on March 27th, 2012 5:36 pm

I try to stay out of the definitional jockeying as to which pieces of Hadoop are required before you can claim that something is a Hadoop distribution.
Joe on March 28th, 2012 10:00 am

FWIW, Gartner considers DataStax Enterprise a Hadoop distribution.

I would, too, given that it’s the same API. Just one commenter’s opinion…
“Enterprise-ready Hadoop” | DBMS 2 : DataBase Management System Services on June 19th, 2012 8:42 pm

[…] Amazon) cloud, or in some cases on a cluster shared with another data management systems. (E.g. DataStax/Cassandra, Hadapt/PostgreSQL, or IBM Netezza.) Anyhow, requiring a dedicated cluster isn’t a […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

DataStax Enterprise and Cassandra revisited

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin