Toward a NoSQL taxonomy
I talked Friday with Dwight Merriman, founder of 10gen (the MongoDB company). He more or less convinced me of his definition of NoSQL systems, which in my adaptation goes:
NoSQL = HVSP (High Volume Simple Processing) without joins or explicit transactions
Within that realm, Dwight offered a two-part taxonomy of NoSQL systems, according to their data model and replication/sharding strategy. I’d be happier, however, with at least three parts to the taxonomy:
- How data looks logically on a single node
- How data is stored physically on a single node
- How data is distributed, replicated, and reconciled across multiple nodes, and whether applications have to be aware of how the data is partitioned among nodes/shards.
After talking with Dwight, and also with Cassandra project chair Jonathan Ellis, I feel I’m doing decently in understanding the first of those three areas. But there’s a long way yet to go on the other two.
In Dwight’s opinion, as I understand it, NoSQL data models come in four general kinds.
- Key-value stores, more or less pure. I.e., they store keys+BLOBs (Binary Large OBjects), except that the “Large” part of “BLOB” may not come into play.
- Table-oriented, more or less. The major examples here are Google’s BigTable, and Cassandra.
- Document-oriented, where a “document” is more like XML than free text. MongoDB and CouchDB are the big examples here.
- Graph-oriented. To date, this is the smallest area of the four. I’m reserving judgment as to whether I agree it’s properly included in HVSP and NoSQL.
As Dwight sees it, JSON (JavaScript Object Notation) is the emerging markup standard for the document-oriented data models, and to some extent the BLOB part of key-value models as well. Reasons seem to include:
- JSON is something web developers are likely to know anyway.
- JSON, unlike XML, is schema-less. In the NoSQL world, that’s perceived as a good thing.
- Perhaps for both these reasons, JSON is perceived as easier to use than XML.
Except as noted, I’m not aware of anything that solidly contradicts the above.
Dwight went on to say that there are two main NoSQL replication/sharding models, in line with the seminal papers to which I previously linked:
- Based on or resembling Dynamo. The core idea here is accepting eventual consistency among nodes as being good enough, even if that means you sometimes read dirty data. The benefit is that you never are blocked from writing. By way of contrast, systems that enforce true inter-node consistency (think of a two-phase commit) can shut you down from writing if consistency guarantees aren’t being confirmed in a timely manner. Thus, in a Dynamo-like scheme you write data to multiple nodes, via consistent hashing; then when the time comes you read one or more nodes, and hope that what you’re getting back is a correct result.
- Based on or resembling BigTable. In this model you’re trying to keep the nodes fully consistent in the usual way, e.g. by synchronous replication. Indeed, what’s being kept consistent is both data itself, and metadata about the data’s location. Details surely vary a lot from implementation to implementation.
I’m fuzzier on this stuff than on the data models, because to date nobody has ever explained to me how an actual live system (MongoDB, Cassandra, whatever) implements its replication strategy. Also, while I think that in both these models applications are allowed to be ignorant of the replication/sharding strategy, I’m not as sure of that as I’d like to be.
If we stop here, we already have something useful. MongoDB has a document data model, and is in the BigTable-like replication camp, at least at first. Cassandra has a table-like data model, and is on the Dynamo-like eventual consistency side. But to say those are the only differences that matter would be like saying that all shared-disk RDBMS (e.g., Oracle and Sybase IQ) are essentially alike. That, of course, would be nonsense.
So a third dimension needed in this taxonomy is how the systems actually bang data on and off of disk (or silicon, as the case may be). I don’t yet have an overview of that. I know something of how Cassandra does it, and will write about same in a future post, but that’s about it. So please stay tuned.
Comments
13 Responses to “Toward a NoSQL taxonomy”
Leave a Reply
Not sure where you’d put Keyspace, which is a consistently replicated key-value store which uses Paxos.
You don’t have to use a schema when you use xml, you can but it is not necessary.
Googling also shows that there are people busy with defining a schema language for JSON.
I don’t see a real difference between a document db that uses JSON and a document db that uses XML. The difference between XML users and JSON users is more “cultural”.
I don’t think it is accurate to state that there are two replication/sharding models. Replication and sharding are different aspects of a system. There is sharding versus not sharding. There is strong versus eventual consistency.
I think the replication/consistency description is more complex. Some of us want strong consistency within a datacenter and something more relaxed between distant datacenters. More relaxed might mean single-master with async replication or it might mean multi-master and eventual consistency.
I am definitely not an expert but HBase has a transaction log per region server (range of the key space for a table). There is not a global transaction log per table. If it were to do replication between datacenters, that would likely be done independently for each region and things could get inconsistent between regions (if you write keys ‘A’ and then ‘B’ locally, the write to ‘B’ might replicate before the write to ‘A’). Maybe an HBase expert can respond with their plans.
Cassandra has more flexibility. While described as providing ‘writes never block’ behavior, you can request higher levels of consistency on reads and writes that may lead to writes blocking. This is a good thing as you pay the price when you want the behavior. However, I am sure that quorum writes allow you to get sync replication within a datacenter and async replication between them — unless you always have a quorum in the datacenter.
@RC, I know next to nothing about JSON. But “feature poker” is a very old game, and it makes perfect sense for the suupporters of each approach to add in the best points of the other, to the extent they can.
@Mark,
I’m struggling for terminology here. Replication, sharding, synchronization, scale-out, reconciliation — nothing is perfect. But it’s all in the area of “A query or update comes into the system, and somebody has to figure out which server(s) to send it to.”
[…] Toward a NoSQL taxonomy […]
[…] Akiban is telling something like a NoSQL […]
[…] use relationship data model. Instead, it used various techniques to represent its data model. As Curt Monash quoted Dwight Merriman, founder of 10gen (MongoDB creator), in his blog, there are some data model used in NoSQL: In Dwight’s opinion, as I understand it, NoSQL data […]
While you enter here into details, imho, I could simply put: NoSQL databases are disguised object databases! With relaxed contraints.
Here is my post: http://www.jroller.com/dmdevito/entry/thinking_about_nosql_databases_classification
[…] selama ini digunakan, NoSQL menggunakan beberapa metode yang berbeda-beda. Metode-metode tersebut menurut Dwight Merriman, salah satu kontributor MongoDB, di antaranya […]
[…] selama ini digunakan, NoSQL menggunakan beberapa metode yang berbeda-beda. Metode-metode tersebut menurut Dwight Merriman, salah satu kontributor MongoDB, di antaranya […]
[…] use relationship data model. Instead, it used various techniques to represent its data model. As Curt Monash quoted Dwight Merriman, founder of 10gen (MongoDB creator), in his blog, there are some data model used in NoSQL: In Dwight’s opinion, as I understand it, NoSQL data […]
I’m late to the party – but here goes.
As a web developer, I like JSON because there is a straightforward mapping to JavaScript. This makes it very easy to process the data (iterate through it, read values from it, use it to configure objects).
With something like XML, processing the data requires the DOM (which leads tedious-to-write verbose code, and which has some browser compatibility issues) or an extra dependency on a more high-level API (since XPath and CSS selectors are either not available on many browser platforms or again suffer from browser compatibility issues)
As for the schemalessness of XML vs JSON, I don’t think this is a problem in practice for most browser-based web applications. I haven’t seen many cases where people actually use XML validation, so nothing is lost in that respect moving to JSON. If you’d want to you’re pretty much stuck with DTD’s (that is, if you want to have a cross-browser compatible solution). XML Schema sadly isn’t really well suppported in browsers. In addition, if you really want to have JSON schemas, you can, with the JSON-Schema emerging standard (http://tools.ietf.org/html/draft-zyp-json-schema-03)
[…] selama ini digunakan, NoSQL menggunakan beberapa metode yang berbeda-beda. Metode-metode tersebutmenurut Dwight Merriman, salah satu kontributor MongoDB, di antaranya […]