When it’s still best to use a relational DBMS
There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. “Ted” Codd himself once suggested that relational DBMS weren’t best for analytics.* Analysis of machine-generated log data doesn’t always have a naturally relational aspect. And I could go on with more examples yet.
*Actually, he didn’t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one — but he was. And he was wrong anyway about the necessity for MOLAP. But let’s overlook those details. 🙂
Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):
- Data re-use. Ted Codd’s famed original paper referred to shared data banks for a reason.
- The benefits of normalization, which include:
- You only have to do programming work of writing something once …
- … and you don’t have to do the programming work of keeping multiple versions of the information consistent.
- You only have to do processing work of writing something once.
- You only have to buy storage to hold each fact once.
- Separation of concerns.
- Different people can worry about programming and “database stuff.”
- Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).
- Maturity and momentum, as reflected in the availability of:
- People.
- A broad variety of mature relational DBMS.
- Vast amounts of packaged software that “talks” SQL.
Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as:
- You’re building a low-volume, medium-complexity suite of applications that will evolve over time. This is the use case for which relational DBMS were invented, and they’re still great for it.
- Your (duplicated) data volumes would be ridiculous if you didn’t do a reasonable amount of normalization. Once you need to normalize, you need to do joins — and if you’re doing joins, you’re in relational territory.
- You simply don’t see a cost/benefit advantage to moving away from proven legacy technology. If you’re looking for an off-the-shelf answer to your needs — or if you’re inventorying your own technological shelves — relational-oriented technology has overwhelming share.
For many enterprises, that third point alone should be decisive in a large fraction of cases.
But the advantages of relational technology are less clear when you’re doing serious engineering of path-breaking new applications, where by “serious engineering” I mean:
- The problem is big enough that you simply want the best solution, with only loose coupling needed to the rest of your technical environment.
- Long-lasting “strategic” or legacy technology is not a great concern; you’re willing to keep “rebuilding the 747 while it’s flying” if that’s what’s necessary to get the best possible result.
- You have access to sufficient quantities of sufficiently smart people.
For example:
- I recently suggested that innovative SaaS vendors could adopt object-oriented database technology.
- Major web applications are rarely very relational. Until recently, the default approach to scaling out web databases was memcached/sharded MySQL, hardly a whole-hearted adoption of relational technology. Now NoSQL DBMS are vigorous competitors.
- Analytic challenges that amount to teasing out signals from streams of data are sometimes handled non-relationally as well, although it’s often nice to be able to do a few joins to mix in information from more relationally-structured data.
Not coincidentally, in a lot of those cases, throwing performance concerns “over the wall” to the database administrator isn’t going to work.
*I do expect the pendulum to swing back a bit as high-performance/highly-scalable MySQL implementations mature, but there are relatively few supporting examples to date.
To look at it another way, it’s right to be skeptical about relational DBMS when you can defeat all of the reasons to favor them. For example:
- Data re-use may not arise when applications are self-contained and rapidly-changing.
- Sometimes you don’t need to normalize your data.
- It’s not obvious that the relational approach to separation of concerns is the best one. Perhaps you’d be better off with the people who understand a specific application best being responsible for all the decisions connected with it.
- As for that maturity and momentum:
- People don’t actually learn much SQL in school.
- Are any of the mature relational DBMS what you really want?
- Is any of that packaged software out there really helpful for your specific problem?
I should probably stop there. But in an appeal to authority, I’ll close instead with a quote from Codd’s own OLAP paper:
IT should never forget that technology is a means to an end, and not an end in itself. Technologies must be evaluated individually in terms of their ability to satisfy the needs of their respective users. IT should never be reluctant to use the most appropriate interface to satisfy users’ requirements. Attempting to force one technology or tool to satisfy a particular need for which another tool is more effective and efficient is like attempting to drive a screw into a wall with a hammer when a screwdriver is at hand: the screw may eventually enter the wall but at what cost?
Related link
Comments
21 Responses to “When it’s still best to use a relational DBMS”
Leave a Reply
Multi-master update, especially over distance, is another problem that relational DBMS implementations do not handle especially well. The core of the problem is that certain well-loved features of SQL like multi-statement transactions and referential integrity do not scale easily across (for example) Amazon zones, leading to compromised availability.
Our company, Continuent, is working on replication technology to make multi-master work as effectively as possible for SQL apps, so we are pretty familiar with this problem. There are a lot of cases where system designers need to look hard at models like Dynamo or document-oriented approaches. It is hard to see how those models can be reconciled easily with legacy SQL DBMS usage.
So you’re tackling multi-master again? 🙂
It’s funny that the concept of “lack of duplication” is so well established in relational. I don’t find a lot of duplication in other stores – whatever their type – necessarily.
Normalization to me is how you avoid dups in relational.
In fact, there is duplication in relational. Here:
ORDER_LINE_ITEM
ORDER ITEM QTY
—————
12345 123 1
12345 187 3
12345 11 1
12345 199 1
In the above the ORDER id repeats four times!
For more on this topic, taking a more theoretically point of view (i.e. not taking into account many of the practical issues that you mentioned above), see my essay at http://danweinreb.org/blog/the-problem-that-relational-databases-solve.
Also, some of the advantages of the relational model apply only when the data is mutable. For example, if you’re going to have append-only data such as log files, denormalization isn’t as problematic since you don’t have to worry about keeping multiple copies of data consistent when it is mutated. Sometimes; it depends on the particular case.
Dwight,
I’d say you avoid duplication via normalization + joins.
It’s when the joins get too burdensome that relational dogma grows dubious.
All normalization says is that you wrote down information in its most concise form, without repeating yourself. No harm in that … except perhaps in the effort of information re-assembly.
I do agree that separating the code from the data is very important in databases — to me this is one of the big ideas from database theory. I should not have to read the code to be able to read the data.
Some of the RDBMS alternatives achieve this. In the JSON-style document stores, I think this happens — JSON gives us a standards based, language independent way to store object-style data. We aren’t storing methods, or class hierarchies, etc.
Dan,
Good post and comment. In particular, I like your highlight of the idea that the relational model lets one physically reorganize data without breaking programs. Ergo, when we look for non-relational use cases, we might want to start in places where physical reorg isn’t going to happen, e.g. because performance and scalability are stressed and being seriously optimized.
Also, I hadn’t previously noted ITA’s adoption of eBay-style denormalized BLOBs. One could say you’d doing a partial document-oriented model within an overall relational structure.
One thing I really like about relational is that the schema is generally use case independent (unless one is ‘cheating’ on the scheme because the database is slow). If you give me your ER diagram i can design your schema without knowing much about your problem.
This has the great side effect that there isn’t a lot of bias in the schema towards a particular use case. This helps with ad hoc queries : if you didn’t know the query ahead of time, you couldn’t have designed your schema for it.
With mongodb, one designs a schema, but designs it for a ‘bread and butter’ use case. It is then very efficient for that use case and easy to work with for that use case. Other ad hoc queries not anticipated a priori can be done, but there is a tiny bit more work because of the schema bias.
Thus: I kind of like relational for reporting/analytics. Also SQL is important as a standardized interface between client (e.g. business objects) and server.
Dwight, is that lower-case or upper-case “business objects”? I.e., I’m asking what the universe is of clients that you have in mind.
Dwight,
I gather it’s your position that MongoDB is in a sweet spot for supporting OO programming? I.e., close enough to the program model to avoid most of the impedance mismatch, yet preserving the benefits of data independence?
Makes sense.
@curt yes i meant “Business Objects” uppercase. Or other client-side tools that talk to databases using SQL. (I haven’t used business objects so i hope that makes sense…) Basically I meant “any reporting or visualization tool as a client.”
re: MongoDB: yes exactly! Albeit RDBMS is not going away. In particular is you need deep transactional semantics, it’s a good fit even in the long run.
http://dmerr.tumblr.com/post/409288020/how-i-view-the-emerging-nosql-space-and-how-it
Dwight,
If we look past an implicit assumption of a two-tier client-server rather than a three-tier architecture, I think I get where you’re coming from with the Business Objects example. 🙂
As for your link, there are naming issues. Vertica and Aster nCluster are analytic RDBMS. OLAP/BI commonly refers to tools such as Business Objects as much as DBMS, especially the “BI” part.
“Perhaps you’d be better off with the people who understand a specific application best being responsible for all the decisions connected with it.”
In general, people overrate their ability to rate their own abilities. Letting them go off like this winds up with a mix of monkey-see/monkey-do sparklies and poorly thought out lack-of-design.
A very small number of successes are not overridden by a large number of failures. And some of the successes are illusory.
The relational model was invented to support cross-department applications. Pre-relational DBMS like IDS, IMS, Total, System 2000, Adabas, etc. all forced a particular perspective – as do the OODBMS and other post-relational technologies.
That said, scaling RDBMS has always been a huge challenge. System R (IBM’s first stab at relational) couldn’t compete with IMS due to poor performance and was mainly used for decision support. IBM learned from this and rebooted their relational strategy with DB2.
The heart of the relational model is _Normalization_. If you’re data models aren’t normalized, and they take on a particular application’s perspective, there will always be trade-offs.
These days “big data” is a hot topic and there is a lot of ink spilled on technology like Hadoop. While it might be interesting for a marketer to perform sophisticated analysis on petabytes of data to better target advertising or prospects or whatnot. But the sad reality is that very few businesses have a grasp on the basics of master data management. For that problem I see the solution being more about scaling of users than scaling of data. For this reason, I’m more interested in technologies like SQLAzure or database.com which don’t support large volumes of data, but do scale up for user loads which is where I see the most pain.
As an example of how little progress we’ve made in master data management, when I call my cableco’s call centre they still ask me to enter my phone number at least 3 times over!!! It’s ridiculous. Keep the sophisticated marketing, I just want an better customer experience.
@Kurt, regarding multi-master we never gave up. For geographically distributed data this problem is not going away any time soon. It almost always takes the form of needing to write the the same DBMS table in multiple locations. There are multiple good business reasons for wanting to do this.
[…] 随着NoSQL种类的逐步确定,以及产品越来越丰富,对数据库产品有了更多的选择,而传统的关系型数据库依然有着巨大的市场,在DBMS2的文章《When it’s still best to use a relational DBMS》中,作者总结了目前关系型数据库的四个优势,如: […]
[…] of course traditional relational databases, such as MySQL, Postgres, Oracle, and DB2, still have their place and can be accessed via Spring JPA and Spring JDBC Extensions, and made accessible via Spring Data […]
Hi,
yesterday I read your DBMS2 post from May 29, 2011 “When it’s still best to use a relational DBMS”
While I have no problems using “Relational Databases” for developing database applications, I departed many years ago from trusting the Relational Model (RM) as a solid base for database theory. Contrary to your post it was problems with normalized database design which started that movement. Our problem with normalization are:
– It is a performance sink. Browse the net for
entries about “Denormalization”
– It destroys the semantic meaning of tables.
The meaning of invoice, a real life document
is easy to communicate, the split in CUSTOMER,
SALESPERSON, INVOICE, LINEITEM and ITEM is
much harder to explain, since CUSTOMER,
SALESPERSON and ITEMS cannot be stored in the
actual catalogs without a very sophisticated
version control and additional performance
penalty
– It is bad design, when domains are not
invariant over the lifetime. We paid dearly to
overlook the fact, that a machinery of type
xyz from manufacturer abc produced in 1990 is
not garantueed to be identical with the 2015
product Inconsistency cannot allways be
avoided, but must be controled.
– Normaization mutated from a workaround to
avoid “more complicated data structure”
(Codd 1970) to a sacrosanct part of Codd’s
theory. The mathematical relations are not
normalized. While reasonable in 1970 Codd’s
restriction is nonsense decades later
– Using catalog data as copy templates and not
as referencesleads to much simpler and more
flexible and robust designs Even the complexity
of the infamous “sales_order sample” could be
reduced by a factor of 2 without missing any
functionality.
– without fragmenting entities in a collection
of entity and weak entities “referential
integrity” is much easier to maintain. Entities
can be viewed as atomic database objects.
In our field of work consistency is limited to
data entry and controled by catalog data. Stored
data are snaphots, valid at the time of data entry and maybe invalid a current time but You
don’t replace all functional machinery when the manufacturer offers a improved design.
If you know any real life application where you have to do “programming work of keeping multiple versions of the information consistent” I would be glad if you send me the reference.
Recards
Karl,
Any time you store an address twice, you have to worry about consistency.
[…] http://www.dbms2.com/2011/05/29/when-to-use-relational-database-management-system/ […]