Notes on document-oriented NoSQL
When people talk about document-oriented NoSQL or some similar term, they usually mean something like:
Database management that uses a JSON model and gives you reasonably robust access to individual field values inside a JSON (JavaScript Object Notation) object.
Or, if they really mean,
The essence of whatever it is that CouchDB and MongoDB have in common.
well, that’s pretty much the same thing as what I said in the first place. 🙂
Of the various questions that might arise, three of the more definitional ones are:
- Why JSON rather than XML?
- What’s with this fluidity between the terms “document” and “object”?
- Are you serious about the lack of joins?
Let me take a crack at each.
Like XML, JSON is a data-interchange format that has been repurposed as a data persistence model. JSON is evidently beating out XML in web applications, for reasons including:
- XML is more verbose and slower than JSON. (Whether this matters or not is of course use-case-dependent.)
- Like SQL, XML requires what some web programmers regard as too much formalism and up-front specification.
- JSON is associated with JavaScript.
- JSON is regarded as being more suited to straightforwardly fielded data, while XML is regarded as being more suited to “mixed content” — e.g., real text documents.
- In general, XML feels “enterprisey” to developers who don’t like that feel.
One good starting point for recent JSON vs. XML discussion is here. My favorite from the 2007 iteration of the debate is this one.
So, in essence:
- The reasons JSON beats XML for web application data interchange have some applicability to web application data storage as well.
- There’s ever more JSON around, at the expense of XML.
But truth be told, I don’t think XML and JSON actually go head to head against each other on the DBMS side very often at all. E.g., Dwight Merriman (the 10gen/MongoDB guy) told me he never, ever competes against MarkLogic, and I found that very credible.*
*Proof point: Dwight was clueless about MarkLogic specifics in a way he never would be if they were any kind of competitive consideration for him. 🙂
Note that the one area where (almost) everybody agrees XML wins is for what one might call “real” documents. By way of contrast, JSON is best suited for stringing data attributes and values together. So the “documents” that JSON models can indeed just as reasonably be called “objects.”
That said, JSON-based DBMS are not what one would normally call object-oriented DBMS; for an example of those, consider Intersystems Cache’. And just to close the loop on confusion — Cache’ can also be used as an XML DBMS.
As I previously noted, one downside to today’s document-oriented DBMS is that you can’t do joins. Let me now add that I think joins will be added to document DBMS in the future. Plausibility arguments for this opinion include:
- MarkLogic — the XML database gold standard — sells to enterprises, and enterprises like joins.
- The alternative to joins in CouchDB and MongoDB is in essence MapReduce. Well, Hive proves that you can do joins on top of MapReduce if you want to. (So, for that matter, does Aster Data nCluster; Aster says its SQL parallelism is built on top of MapReduce.)
- Intersystems quite happily put SQL on top of an object-oriented DBMS, Cache’. And Cache’ is so similar to an XML DBMS that it in fact is sometimes used as one.
But that is indeed a future. For discussion of the current state of affairs, I refer you to my earlier post on the subject of joinlessness linked above.
Comments
16 Responses to “Notes on document-oriented NoSQL”
Leave a Reply
MongoDB works excellent in the case of one-to-many relationships but better support for many-to-many relationships would indeed be a plus.
Consistency semantics are simpler in a world without joins and integrity constraints. This is helpful in any distributed system, and more so for mobile (eg couch) where there may be infrequent reconciliation of data changes.
A different data model in itself is not an architectural differentiator. It’s possible to integrate a document interface into a relational database and get the full advantage of both models. We did this at Clustrix. I have a brief writeup at http://sergeitsar.blogspot.com/2011/02/clustrix-as-document-store-blending-sql.html
MongoDB does not store JSON documents, but rather JSON-style documents – specifically BSON (http://bsonspec.org/). It has important performance benefits for mostly numeric flexible-schema stores (read – health and social statistics, finance). Effectively the data does not need to serialize in and out of character stream between application objects and the store.
That also allows MongodB to manage storage as a set of memory-mapped files so that the DB server has little need and overhead of managing data persistence on disk. A side effect of memory mapped files efficiency is that objects are capped in size. I believe the current limit is 8MB, but do not quote me on that.
Many RDBMS implementations can have explicit (foreign keys) and implicit (joins) references between data items. That allows to build an arbitrary, albeit complex, data graph and have it persisted in the data’s meta-data or at least somewhere between an application and DB. For example in queries, views, and stored procedures.
BSON, like JSON, represents inherently acyclic data graphs – effectively directed trees. It has no build-in mechanism to keep record of any relationships in data except for containment at below object level. That seems to be consistent with MongoDB’s philosophy of disclaiming any significant responsibility for meta-data. If schema management is not in the DB engine, then why should meta-schema be in the DB engine? That is a blessing and a curse, as one needs to use a proprietary format if they want to persist the structural information in the data store.
In XML-based stores one has an option of using the family of XML-related standards to record and query edges of a data graph. One can check a schema validity for an XML document. I doubt MarkLogic does it natively today however it is very conceivable to have XPath references from inside one document into innards of another and have the relationship followed as a part of a query. This is a bit more than a “true document” notion as it is a cross-document relationship.
Same thing – it is a blessing and a curse as it brings to the table character serialization penalty and an easy way to make a very convoluted data design. And I do not even want to go into performance issues of a random graph traversal.
(cross-posted at http://blog.didenko.com/2011/02/about-notes-on-document-oriented-nosql.html)
Hi Curt,
I think this is a great essay on NoSQL software, clarifying what’s really new/different and what’s not. Definitely need more posts like this!
I just wanted to clarify that our SQL is not built on top of MapReduce like, say, Hive. We have a native SQL stack and a native MapReduce stack, and they roughly sit next to each other with a fast data bus connecting them. Our global planner/optimizer is written to natively understand both. Our SQL queries can use data statistics and cost based optimization, which is essential for SQL performance, in contrast to a Hadoop/HDFS-based model.
Happy to talk more about this, if you like.
Vlad: Why reinvent the wheel. Seems like BSON is a try to redefine what the XDM already is?
p.s. joins are not hard to do in document dbms. But joins are by definition a relational thing and make less sense in document databases. XInclude or range indexes can be used to solve joinsin a pretty easy away. The reason why we try to get away from joins is because it probably means you are doing your data model wrong (check http://opensourcebridge.org/proposals/524)
Nuno, that is a questionable impression. The key word (in the context of my comment) is “binary”. In larger context it simply has different purpose.
I am not sure that is THE definition of joins. I prefer to treat them as edges of a graph. We are effectively talking about the same thing on the rest.
The benefits of binary storage is a completely different topic. However I can tell you that I personally disagree with your opinions about it’s advantages. As everything it’s a trade-off and in my opinion you receive less than what you have to give
Nuno, I understand 🙂 For the record, I am not affiliated in any way neither with MongoDB, nor MarkLogic.
* when I wrote binary storage I actually meant binary interchange formats
Thanks for the clarification, Tasso.
I thought the integration was even tighter than that.
From Tasso: “Our SQL queries can use data statistics and cost based optimization, which is essential for SQL performance, in contrast to a Hadoop/HDFS-based model.”
Actually, Hive collects statistics and CBO is being implemented presently: https://issues.apache.org/jira/browse/HIVE-1938.
Curt
Nice article.
Regarding the use of Cache as an XML database, see http://gradvs1.mgateway.com/main/index.html?path=mdbx
Note that this application/database allows JSON interfacing to a Native XML Database which is an interesting hybrid approach.
Also see https://github.com/robtweed/node-mwire for a JSON-interface to Cache.
Finally you may be interested in this document that describes how Cache can be used to model all the main NoSQL database types, including the document orientated ones you’ve described in your article:
http://www.mgateway.com/docs/universalNoSQL.pdf
Rob
The most obvious reason for using JSON is that it maps one-to-one to objects because it is typed, and it has arrays and hashes. This eliminates the need for an ORM.
With XML, you need a schema to do this (is a value a string or an int, is it one item, or a collection?)
There’s also YAML, which is similar to JSON. But it’s getting more popular, because it’s easier to type and read than JSON.
A lot of NoSql solutions use javascript as the scripting engine for querying or for MapReduce, which makes the choice for JSON obvious.
[…] The “document”/”object” DBMS distinction has long been blurry. XML is full of documents, but they’re really objects. The same goes for the JSON/quasi-JSON objects of CouchDB/Couchbase and MongoDB. Object-oriented DBMS vendors have dabbled in XML on and off over the years because of technical similarity. Etc. […]
terrific use of terminology inside the writing, it in reality did help when i was surfing around