DataStax and Cassandra update
MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.
It seems fair to say that in most cases:
- Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
- Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.
Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:
- DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
- Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
- A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.
*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. 🙂
While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:
- You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
- In the general case, getting it back out for low-latency analytics is problematic …
- … but there’s an increasing list of exceptions.
For DataStax Enterprise, exceptions start:
- Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
- Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
- DataStax is alert to the need to stream data into Cassandra.
- That’s central to the NoSQL expectation of ingesting internet data very quickly.
- Kafka, Storm and Spark Streaming all seem to be in the mix.
- Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.
*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.
DataStax Enterprise:
- Is based on Cassandra 2.1.
- Will probably never include Cassandra 2.2, waiting instead for …
- ….Cassandra 3.0, which will feature a storage engine rewrite …
- … and will surely include Cassandra 2.2 features of note.
This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:
- These are functions — not libraries, a programming language, or anything like that.
- The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
- You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.
And finally, some general tidbits:
- A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
- There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
- Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
- There are Cassandra users with >1 million reads+writes per second.
Finally a couple of random notes:
- One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
- As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.
Comments
6 Responses to “DataStax and Cassandra update”
Leave a Reply
My guess is that the Apple workload on Cassandra is much more from iMessage than from iTunes.
And MY guess is that I’m emphatically under NDA. 🙂
[…] database guru Mark Callaghan posits that Apple’s Cassandra workload likely relates more to iMessage than iTunes, but whatever the […]
I’m not sure of Cassandra as the extreme write speed go to choice. It has good operational stability and has some nicer *read/search* capabilities than the KVPs of the world.
High throughput writes tend to go to memcache or Redis (see a biased https://redislabs.com/cbc-2015-15-nosql-benchmark).
Spark adds analytics (or even simple joins and other SQL niceties) and is typically run in parallel to a persistent store.
[…] a post about DataStax, Curt Monash notes synergies between Spark and […]
[…] database guru Mark Callaghan posits that Apple’s Cassandra workload likely relates more to iMessage than iTunes, but whatever the […]