Cassandra
Analysis and discussion of the open source data management project Cassandra. Related subjects include:
- Riptano, a company founded to commercialize Cassandra
- The NoSQL movement
- Open source data management technologies
- Facebook, the originator of Cassandra
Cassandra company DataStax (formerly Riptano) is on track
Riptano, the Cassandra company, has changed its name to DataStax. DataStax has opened headquarters in Burlingame and hired some database-experienced folks – notably Ben Werther from Greenplum and Michael Weir from ParAccel, with Zenobia Godschalk (who worked with Aster Data) somewhere in the outside PR mix. Other than that, what’s new at DataStax is pretty much what could have been expected based on what DataStax folks said last spring.
Most notably, DataStax is introducing a software offering, whose full name is DataStax OpsCenter for Apache Cassandra. DataStax OpsCenter for Apache Cassandra seems to be, in essence, a monitoring tool for Cassandra clusters, with a bit of capacity planning bundled in. (If there are any outright operations parts to DataStax OpsCenter, they got overlooked in our conversation.)* Read more
Categories: Cassandra, DataStax, Market share and customer counts, NoSQL, Specific users, Telecommunications | 1 Comment |
More on NoSQL and HVSP (or OLRP)
Since posting last Wednesday morning that I’m looking into NoSQL and HVSP, I’ve had a lot of conversations, including with (among others):
- Dwight Merriman of 10gen (MongoDB)
- Damien Katz of Couchio (CouchDB)
- Matt Pfeil of Riptano (Cassandra)
- Todd Lipcon of Cloudera (HBase committer)
- Tony Falco of Basho (Riak)
- John Busch of Schooner
- Ori Herrnstadt of Akiban
I’m collecting data points on NoSQL and HVSP adoption
I was asked to do a magazine article on NoSQL, where by “NoSQL” is meant “whatever they talk about at NoSQL conferences.” By now the number of publications planning to run the article is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL alike.
It also is understood that, realistically, I can’t be expected to know and mention the very latest news for all the many products in the categories. Even so, I think this would be fine time to check just where NoSQL and HVSP adoption stand. Here is most of what I know, or links to same; it would be great if you guys would contribute additional data in the comment thread.
In the NoSQL area: Read more
Links and observations
I’m back from a trip to the SF Bay area, with a lot of writing ahead of me. I’ll dive in with some quick comments here, then write at greater length about some of these points when I can. From my trip: Read more
Categories: Analytic technologies, Aster Data, Calpont, Cassandra, Couchbase, Data warehouse appliances, Data warehousing, EMC, Exadata, Facebook, Greenplum, HP and Neoview, Kickfire, NoSQL, OLTP, ParAccel, Sybase, XtremeData | 1 Comment |
Riptano, and Cassandra adoption
Tonight’s Cassandra technology post got plenty long enough on its own, so I’m separating out business and adoption issues here. For starters, known Cassandra users include:
- Facebook, which has said it has 150 or so Cassandra nodes (but see below)
- Twitter, which has said it has 45 or so Cassandra nodes
- Rackspace, which used to be Jonathan Ellis’ employer, and now is backing Cassandra company Riptano
- Digg, which along with Twitter and Rackspace was one of the three major users helping advance the Cassandra project
- OpenX, Simple Geo, Digital Reasoning, who Jonathan cited as production users in March
- Cloudkick, as noted and linked in my other post
- Two customers Riptano named at launch (but I’ve forgotten who they were*)
Fetlife, Meebo, and others seem to at least have a healthy interest in Cassandra, based on their level of involvement in a forthcoming Cassandra Summit. That said, the @Fetlife tweetstream features numerous yelps of pain, and I don’t mean the recreational kind. Read more
Categories: Cassandra, DataStax, Facebook, Market share and customer counts, NoSQL, Open source, Parallelization, Pricing, Specific users | 5 Comments |
Cassandra technical overview
Back in March, I talked with Jonathan Ellis of Rackspace, who runs the Apache Cassandra project. I started drafting a blog post then, but never put it up. Then Jonathan cofounded Riptano, a company to commercialize Cassandra, and so I talked with him again in May. Well, I’m finally finding time to clear my Cassandra/Riptano backlog. I’ll cover the more technical parts below, and the more business- or usage-oriented ones in a companion Cassandra/Riptano post.
Jonathan’s core claims for Cassandra include:
- Cassandra is shared-nothing.
- Cassandra has good approaches to replication and partitioning, right out of the box.
- In particular, Cassandra is good for use cases that distribute a database around the world and want to access it at “local” latencies. (Indeed, Jonathan asserts that non-local replication is a significant non-big-data Cassandra use case.)
- Cassandra’s scale-out is application-transparent, unlike sharded MySQL’s.
- Cassandra is fast at both appends and range queries, which would be hard to accomplish in a pure key-value store.
In general, Jonathan positions Cassandra as being best-suited to handle a small number of operations at high volume, throughput, and speed. The rest of what you do, as far as he’s concerned, may well belong in a more traditional SQL DBMS. Read more
Categories: Amazon and its cloud, Cassandra, DataStax, Facebook, Google, Log analysis, NoSQL, Open source, Parallelization | 4 Comments |
Daniel Abadi on NoSQL design tradeoffs
In a thought-provoking post, Daniel Abadi points out NoSQL-related terminological problems similar to the ones I just railed against, and argues
To me, CAP should really be PACELC — if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)?
and goes on to say
For example, Amazon’s Dynamo (and related systems like Cassandra and SimpleDB) are PA/EL in PACELC — upon a partition, they give up consistency for availability; and under normal operation they give up consistency for lower latency. Giving up C in both parts of PACELC makes the design simpler — once the application is configured to be able to handle inconsistencies, it makes sense to give up consistency for both availability and lower latency.
However, I think Daniel’s improved formulation is still misleading, in at least two ways:
- Daniel implicitly assumes any given NoSQL system makes a fixed set of tradeoffs, when actually — as he in fact notes in his post — some of them offer tradeoffs that are quite tunable.
- I think Daniel is at best oversimplifying when he appears to assert that best-case network latency is an important design criterion for all that many NoSQL systems. Naively, anything that acknowledges reads or writes requires two hops. Two-phase commit (2PC) requires three hops. 33% latency reductions are not the kinds of goals that drive dramatic DBMS redesigns, even though tenths of seconds — i.e. 100s of milliseconds — matter in the kinds of environments where NoSQL is sprouting up.
Categories: Amazon and its cloud, Cassandra, NoSQL, Theory and architecture | 2 Comments |
Toward a NoSQL taxonomy
I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:
NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions
Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I’d be happier, however, with at least three parts to the taxonomy:
- How data looks logically on a single node
- How data is stored physically on a single node
- How data is distributed, replicated, and reconciled across multiple nodes, and whether applications have to be aware of how the data is partitioned among nodes/shards. Read more
Categories: Cassandra, Data models and architecture, NoSQL, Parallelization, RDF and graphs, Structured documents, Theory and architecture | 13 Comments |
Some NoSQL links
I plan to post a few things soon about MongoDB, Cassandra, and NoSQL in general. So I’m poking around a bit reading stuff on the subjects. Here are some links I found. Read more
Categories: Amazon and its cloud, Cassandra, Continuent, Google, MySQL, NoSQL, Open source, RDF and graphs, Tokutek and TokuDB | 5 Comments |
Cassandra and the NoSQL scalable OLTP argument
Todd Hoff put up a provocative post on High Scalability called MySQL and Memcached: End of an Era? The post itself focuses on observations like:
- Facebook invented and is adopting Cassandra.
- Twitter is adopting Cassandra.
- Digg is adopting Cassandra.
- LinkedIn invented and is adopting Voldemort.
- Gee, it seems as if the super-scalable website biz has moved beyond MySQL/Memcached.
But in addition, he provides a lot of useful links, which DBMS-oriented folks such as myself might have previously overlooked. Read more