September 14, 2015

DataStax and Cassandra update

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. 🙂

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
In the general case, getting it back out for low-latency analytics is problematic …
… but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
DataStax is alert to the need to stream data into Cassandra.
- That’s central to the NoSQL expectation of ingesting internet data very quickly.
- Kafka, Storm and Spark Streaming all seem to be in the mix.
Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

Is based on Cassandra 2.1.
Will probably never include Cassandra 2.2, waiting instead for …
….Cassandra 3.0, which will feature a storage engine rewrite …
… and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

These are functions — not libraries, a programming language, or anything like that.
The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.

Categories: Business intelligence, Cassandra, Databricks, Spark and BDAS, DataStax, NoSQL, Open source, Petabyte-scale data management, Predictive modeling and advanced analytics, Specific users, Text

Subscribe to our complete feed!

Comments

6 Responses to “DataStax and Cassandra update”

Mark Callaghan on September 14th, 2015 12:13 pm

My guess is that the Apple workload on Cassandra is much more from iMessage than from iTunes.
Curt Monash on September 14th, 2015 1:35 pm

And MY guess is that I’m emphatically under NDA. 🙂
Apple's secret NoSQL sauce includes a hefty dose of Cassandra | High Tech News on September 16th, 2015 10:23 am

[…] database guru Mark Callaghan posits that Apple’s Cassandra workload likely relates more to iMessage than iTunes, but whatever the […]
Aaron on September 17th, 2015 9:26 am

I’m not sure of Cassandra as the extreme write speed go to choice. It has good operational stability and has some nicer *read/search* capabilities than the KVPs of the world.

High throughput writes tend to go to memcache or Redis (see a biased https://redislabs.com/cbc-2015-15-nosql-benchmark).

Spark adds analytics (or even simple joins and other SQL niceties) and is typically run in parallel to a persistent store.
Big Analytics Roundup (September 21, 2015) | The Big Analytics Blog on September 21st, 2015 1:17 pm

[…] a post about DataStax, Curt Monash notes synergies between Spark and […]
Apple’s secret NoSQL sauce includes a hefty dose of Cassandra | nosqlblog on September 28th, 2015 4:58 pm

[…] database guru Mark Callaghan posits that Apple’s Cassandra workload likely relates more to iMessage than iTunes, but whatever the […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

DataStax and Cassandra update

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin