Cassandra technical overview
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS.
Further highlights of our talks included, as best I understood them:
- Cassandra is based in parts both on Google’s BigTable paper of 2006 and Amazon’s Dynamo paper of 2007.
- The core of what Cassandra takes from BigTable is based on log-structured merge trees, which actually entered the computer science literature in 1996.
- Cassandra’s approach to horizontal scaling, replication, failover, etc. seems to be based Dynamo.
- There seems to be a logical concept of “row” in Cassandra, or it’s at least meaningful to use the SQL/relational concept of a “row” when talking about Cassandra data. However, Cassandra is closer to being a column-based data store than a row-based one. (Not the same thing, but closer.)
- Even so, it only takes a single seek to return a whole Cassandra “row”.
- Cassandra writes data quite differently from the way a classical OLTP DBMS would.
- Cassandra writes just the data elements – i.e., fields – that are actually being inserted or changed, not whole rows.
- One benefit is that Cassandra data is very sparse. NULLs aren’t stored in any way, and hence in particular take up no space.
- Another benefit – and one of the core concepts of Cassandra – is that you can implicitly assume different schemas for different rows of the same “table.” In particular, you can add data for columns that you didn’t envision when you first started storing “rows” of the same “table.”
- Writes are collected into sorted “memtables,” which from time to time are sent to disk. Once data gets to disk, it’s immutable, except for occasional merge/reorganization/garbage collection.
- Jonathan claims, plausibly, that this makes write throughput very fast (because the I/O is fundamentally sequential in nature.)
- The default as to how long data typically stays in memory before it gets persisted to disk is “whichever comes first of {64 MB written, 300k updates, 1 hour}”.
- Cassandra has durability – guaranteed non-loss of data – assuming fsync is turned on. fsync seems to create a 15% or so overhead.
- However, Cassandra has no concept of a “transaction.”
- As one would expect, data can be read even before it has been persisted to disk.
- According to Jonathan, Cassandra can do about 14,000 writes or 7,000 reads per second, on a quad-core server.
- Those figures scale pretty linearly with the number of servers. (There’s some overhead for network latency.)
- Those figures assume a five-column row.
- Cassandra’s write-performance figures are only “mildly sensitive” to the width of the row. E.g., doubling row width only gives a 15-20% throughput hit, due to some fixed per-row overhead. That said, I imagine going 100X in row width would create a major slowdown, although perhaps while measuring width more in bytes than in column count.
- Cassandra’s performance has been growing nicely in each point release. Jonathan thinks this general trend will continue.
- Jonathan thinks Cassandra is pretty good at keeping your data safe.
- Each node has a commit log.
- When a node goes down, its writes are buffered until it comes back up.
- You can run Hadoop MapReduce straight against Cassandra files.
- A Cassandra node might hold anything from 10s of gigabytes to multiple terabytes of data. You might want to go with the low end if you want to have lots of cache hits.
- Solid-state storage would speed up Cassandra reads, not writes, and is not widely used with Cassandra yet.
- Jonathan says Cassandra is really good at handling time series data, by which I suspect he means log files. Cloudkick is a user of this capability.
I certainly didn’t grasp everything about Cassandra replication and partitioning strategies. That wasn’t the focus of our talks, and anyway I got the impression they are so flexible that there’s little that can firmly be said about them. But I did get the impressions:
- You set your consistency rules in the Cassandra API, not on a per-table basis. (I think this means that a lack of administrative tools is supposedly a feature, not a drawback.)
- As a practical matter, Cassandra users commonly take one of two approaches to consistency:
- RYW consistency, most commonly with N = 3 and R = W = 2.
- Geographically dispersed eventual consistency.
- Cassandra data is most commonly distributed via consistent hashing, but other options are “pluggable.”
- If you add a node, the busiest note automagically decides to ship some data over, reducing its load. Of course, this only works if you get the new node on before the old node is so maxed out it doesn’t have time to do the shipping.
When we talked in March, the next release of Cassandra was going to be 0.7. Cassandra 0.7 was going to be a performance/scalability release, for example fixing the flaw that garbage collection read rows into memory one at a time. After that, Cassandra 0.8 was to be a feature release, with one planned feature being more automatic index management and/or materialized-view-like capability, so as to reduce the burden on Cassandra developers of schema management.
Related links
- My March NoSQL links post included the Google and Amazon papers
- The March 2, 2010 Cloudkick post also linked above goes into a lot of detail, including what they think is great about Cassandra and what they think is still missing
Comments
4 Responses to “Cassandra technical overview”
Leave a Reply
[…] I posted much fresher information about Cassandra in July, […]
[…] Cassandra technology post got plenty long enough on its own, so I’m separating out business and adoption issues here. […]
[…] Cassandra technical overview […]
[…] Membase Node Code be a close substitute for relational DBMS functionality, or even the Cassandra architecture? I doubt it, especially at first. But at least it will keep Membase developers from […]