July 6, 2010

Cassandra technical overview

Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.

Jonathan’s core claims for Cassandra include:

Cassandra is shared-nothing.
Cassandra has good approaches to replication and partitioning, right out of the box.
In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.

In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS.

Further highlights of our talks included, as best I understood them:

Cassandra is based in parts both on Google’s BigTable paper of 2006 and Amazon’s Dynamo paper of 2007.
- The core of what Cassandra takes from BigTable is based on log-structured merge trees, which actually entered the computer science literature in 1996.
- Cassandra’s approach to horizontal scaling, replication, failover, etc. seems to be based Dynamo.
There seems to be a logical concept of “row” in Cassandra, or it’s at least meaningful to use the SQL/relational concept of a “row” when talking about Cassandra data. However, Cassandra is closer to being a column-based data store than a row-based one. (Not the same thing, but closer.)
Even so, it only takes a single seek to return a whole Cassandra “row”.
Cassandra writes data quite differently from the way a classical OLTP DBMS would.
- Cassandra writes just the data elements – i.e., fields – that are actually being inserted or changed, not whole rows.
- One benefit is that Cassandra data is very sparse. NULLs aren’t stored in any way, and hence in particular take up no space.
- Another benefit – and one of the core concepts of Cassandra – is that you can implicitly assume different schemas for different rows of the same “table.” In particular, you can add data for columns that you didn’t envision when you first started storing “rows” of the same “table.”
- Writes are collected into sorted “memtables,” which from time to time are sent to disk. Once data gets to disk, it’s immutable, except for occasional merge/reorganization/garbage collection.
  - Jonathan claims, plausibly, that this makes write throughput very fast (because the I/O is fundamentally sequential in nature.)
  - The default as to how long data typically stays in memory before it gets persisted to disk is “whichever comes first of {64 MB written, 300k updates, 1 hour}”.
- Cassandra has durability – guaranteed non-loss of data – assuming fsync is turned on. fsync seems to create a 15% or so overhead.
- However, Cassandra has no concept of a “transaction.”
- As one would expect, data can be read even before it has been persisted to disk.
According to Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per second, on a quad-core server.
- Those figures scale pretty linearly with the number of servers. (There’s some overhead for network latency.)
- Those figures assume a five-column row.
- Cassandra’s write-performance figures are only “mildly sensitive” to the width of the row. E.g., doubling row width only gives a 15-20% throughput hit, due to some fixed per-row overhead. That said, I imagine going 100X in row width would create a major slowdown, although perhaps while measuring width more in bytes than in column count.
- Cassandra’s performance has been growing nicely in each point release. Jonathan thinks this general trend will continue.
Jonathan thinks Cassandra is pretty good at keeping your data safe.
- Each node has a commit log.
- When a node goes down, its writes are buffered until it comes back up.
You can run Hadoop MapReduce straight against Cassandra files.
A Cassandra node might hold anything from 10s of gigabytes to multiple terabytes of data. You might want to go with the low end if you want to have lots of cache hits.
Solid-state storage would speed up Cassandra reads, not writes, and is not widely used with Cassandra yet.
Jonathan says Cassandra is really good at handling time series data, by which I suspect he means log files. Cloudkick is a user of this capability.

I certainly didn’t grasp everything about Cassandra replication and partitioning strategies. That wasn’t the focus of our talks, and anyway I got the impression they are so flexible that there’s little that can firmly be said about them. But I did get the impressions:

You set your consistency rules in the Cassandra API, not on a per-table basis. (I think this means that a lack of administrative tools is supposedly a feature, not a drawback.)
As a practical matter, Cassandra users commonly take one of two approaches to consistency:
- RYW consistency, most commonly with N = 3 and R = W = 2.
- Geographically dispersed eventual consistency.
Cassandra data is most commonly distributed via consistent hashing, but other options are “pluggable.”
If you add a node, the busiest note automagically decides to ship some data over, reducing its load. Of course, this only works if you get the new node on before the old node is so maxed out it doesn’t have time to do the shipping.

When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.

Related links

My March NoSQL links post included the Google and Amazon papers
The March 2, 2010 Cloudkick post also linked above goes into a lot of detail, including what they think is great about Cassandra and what they think is still missing

Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization

Subscribe to our complete feed!

Comments

4 Responses to “Cassandra technical overview”

Project Cassandra — Facebook’s open sourced quasi-DBMS | DBMS2 -- DataBase Management System Services on July 6th, 2010 5:11 am

[…] I posted much fresher information about Cassandra in July, […]
Riptano, and Cassandra adoption | DBMS2 -- DataBase Management System Services on July 6th, 2010 5:12 am

[…] Cassandra technology post got plenty long enough on its own, so I’m separating out business and adoption issues here. […]
YunTable开发日记（8）-聊聊分布式数据库的作用 « 人云亦云 on July 17th, 2010 4:37 am

[…] Cassandra technical overview […]
Finally confirmed: Membase has a reasonable product roadmap | DBMS2 -- DataBase Management System Services on August 18th, 2010 5:37 am

[…] Membase Node Code be a close substitute for relational DBMS functionality, or even the Cassandra architecture? I doubt it, especially at first. But at least it will keep Membase developers from […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Cassandra technical overview

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin