Document-oriented DBMS without joins
When I talked with MarkLogic’s Ken Chestnut about MarkLogic 4.2, I was surprised to learn that MarkLogic really, truly doesn’t do anything like a join. Unlike some other non-SQL DBMS, MarkLogic has no SQL interface, no ODBC or JDBC. Nothing, nada. (MarkLogic has a Java interface for Xquery, but not for anything like SQL.)
Since MarkLogic and other XML DBMS are used in applications for brokerage trades and the like, I used that area as my example for a challenge question: What happens when one brokerage firm buys another? (Similar challenges could be made about medical records or consumer profiling.) The answer was that you just have to update or augment each existing record with the new firm’s information. And by the way, if you choose to augment, then you have the new and old information side-by-side, both of which could conceivably come in handy.
Document-oriented NoSQL DBMS such as CouchDB and MongoDB face similar challenges, of course. I didn’t pursue the matter in depth in either case, but:
- If I understood Damien Katz correctly, CouchDB has a view capability that provides some kind of workaround. (A quick web search turned up this page on a kind of entity-relationship modeling in CouchDB and the associated querying.)
- Dwight Merriman suggested to me that in MongoDB, you can work around the lack of joins via client-side logic, or by embedding lots of data in each document (e.g., all the line items for an order) and extracting what you need via MapReduce jobs.
I’m not totally sure what I think about joinlessness, but one way of looking at it could be:
- The reason we have joins is because we normalize. If it’s OK to be highly denormalized, then it’s less important to have joins.
- When normalization is good and denormalization is bad, one or both of two reasons are commonly in play:
- The logical burden of keeping straight all the different places you’d have to update the same data is too great for the poor, overburdened programmers.
- The performance burden of doing all that updating is too great for the poor, overburdened hardware.
- For the logical reason to have great force, there has to be a pretty complex schema, or else a frequently changing one. But when schemas change frequently, relational designs have their own problems.
- The physical reason automatically has great force if you have huge update volumes and keep many copies of the same data. Otherwise, its strength has a lot to do with the specific architecture of the DBMS. E.g., if it’s a lot cheaper to update a small record than a big one, short rows are better. But otherwise, denormalization may not have that much effect on performance.
Putting all that together, I’m inclined to think that for many applications, it’s OK to denormalize, or to have such a simple schema that normalization is moot. But even so, I’d be a lot more comfortable if a DBMS offered at least some way of doing a join.
All this raises a related question: What are transactions like in document-oriented DBMS? I’ve never pushed the point with MarkLogic, but when they talk of their ACID compliance they sound as if they are using the phrase in the usual way. MongoDB only lets you do transactions in single documents. I’ve never asked the question about CouchDB, but I do note with interest CouchDB’s “crash-only” architecture, which boils down to:
- CouchDB shutdown is “instantaneous.”
- You can only shut down CouchDB by crashing it.
- There’s no way to shut down or crash CouchDB that causes data to be inconsistent.
Comments
8 Responses to “Document-oriented DBMS without joins”
Leave a Reply
I’m not sure what normalization has to do with ACID. Why wouldn’t a documented-oriented database implement ACID, at least as an option?
For MarkLogic Server, my understanding is that transactions are ACID and look much like RDBMS transactions. If an update touches 100 documents, they all update at the same timestamp. The XPath, full-text, and range index entries that point to those documents also update in sync. In short, documents act much like RDBMS rows would.
[…] Monash has a reasonable post where he points out that we need joins because we normalize. Furthermore, he offers reasons for why […]
One of the stereotypical cases for transactional semantics is when you have orders and line items and so on, and they’re all in separate tables. That makes the most sense in an architecture where you can also have joins.
An even more stereotypical case is one in which you credit one account and debit another, and those updates had better either both go through or neither. The tie to joins there isn’t as strong, but stylistically it makes one think of the kind of short-row OLTP where joins are likely to also come into play.
That’s the kind of thing I was thinking of.
There is only one common use case for joins I can think of where denormalization is not an option: transitive closure type queries (e.g. non-trivial graph search and traversal), which are typically implemented using recursive joins.
Denormalization in these cases usually has a prohibitively expensive space complexity such that it is only an option for very “shallow” queries.
I second – Join is critical in some search-type applications.
This is a dirty little secret that search engine vendors hide when they boast their ability to index database content. You have to pre-join all the tables that belong to a searchable object (aka de-normalization or flattening) in order to index them.
I have seen intelligence applications that tried to flatten huge objects (> 20,000 rows) each time a row was updated. The poor indexing machines were working most of the time on few large, highly volatile objects. Lots of hardware was thrown at the problem with little help.
The only once that have Join with text-search are Attivio ( AFAIK). That is the reason we have chosen them for huge intelligence-type application.
FAST search had Join ambitions, but I think they gave it up when they became part of SharePoint.
[…] I previously noted, one downside to today’s document-oriented DBMS is that you can’t do joins. Let me now add that I think joins will be added to document DBMS in the future. Plausibility […]
[…] have indexes that at least support relational-like SELECTs. JOINs can be more problematic, but MarkLogic finally has them. Tokutek even offers a 3rd-party indexing option for […]
[…] with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since […]