NoSQL vs. NewSQL vs. traditional RDBMS
I frequently am asked questions that boil down to:
- When should one use NoSQL?
- When should one use a new SQL product (NewSQL or otherwise)?
- When should one use a traditional RDBMS (most likely Oracle, DB2, or SQL Server)?
The details vary with context — e.g. sometimes MySQL is a traditional RDBMS and sometimes it is a new kid — but the general class of questions keeps coming. And that’s just for short-request use cases; similar questions for analytic systems arise even more often.
My general answers start:
- Sometimes something isn’t broken, and doesn’t need fixing.
- Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
- Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.
In particular, migration away from legacy DBMS raises many issues:
- Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
- Your staff’s programming and administrative skill-sets.
- Your investment in DBMS-related tools.
- Your supply of hockey tickets from the vendor’s salesman.
Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.
Commonly, the good reason to change DBMS is one or more of:
- Programming model. Increasingly often, dynamic schemas seem preferable to fixed ones. Internet-tracking nested data structures are just one of the reasons.
- Performance (scale-out). DBMS written in this century often scale out better than ones written in the previous millennium. Also, DBMS with fewer features find it easier to scale than more complex ones; distributed join performance is a particular challenge.
- Geo-distribution. A special kind of scale-out is geo-distribution, which is sometimes a compliance requirement, and in other cases can be a response-time nice-to-have.
- Other stack choices. Couchbase gets a lot of its adoption from existing memcached users (although they like to point out that the percentage keeps dropping). HBase gets a lot of its adoption as a Hadoop add-on.
- Licensing cost. Duh.
NoSQL products commonly make sense for new applications. NewSQL products, to date, have had a harder time crossing that bar. The chief reasons for the difference are, I think:
- Programming model!
- Earlier to do a good and differentiated job in scale-out.
- Earlier to be at least somewhat mature.
And that brings us to the 762-gigabyte gorilla — in-memory DBMS performance — which is getting all sorts of SAP-driven marketing attention as a potential reason to switch. One can of course put any database in memory, providing only that it is small enough to fit in a single server’s RAM, or else that the DBMS managing it knows how to scale out. Still, there’s a genuine category of “in-memory DBMS/in-memory DBMS features”, principally because:
- In-memory database managers can and should have a very different approach to locking and latching than ones that rely on persistent storage.
- Not all DBMS are great at scale-out.
But Microsoft has now launched Hekaton, about which I long ago wrote:
I lack detail, but I gather that Hekaton has some serious in-memory DBMS design features. Specifically mentioned were the absence of locking and latching.
My level of knowledge about Hekaton hasn’t improved in the interim; still, it would seem that in-memory short-request database management is not a reason to switch away from Microsoft SQL Server. Oracle has vaguely promised to get to a similar state one of these years as well.
Of course, HANA isn’t really a short-request DBMS; it’s an analytic DBMS that SAP plausibly claims is sufficiently fast and feature-rich for short-request processing as well.* It remains to be seen whether that difference in attitude will drive enough sustainable product advantages to make switching make sense.
*Most obviously, HANA is columnar. And it has various kinds of integrated analytics as well.
Related links
- Wants vs. needs (March, 2014)
- The refactoring of everything (July, 2013)
- Notes on memory-centric data management (January, 2014)
- Traditional databases will eventually wind up in RAM (May, 2011)
- Coverage of memory-centric DBMS flag-wavers MemSQL, Aerospike, and SAP HANA
Comments
18 Responses to “NoSQL vs. NewSQL vs. traditional RDBMS”
Leave a Reply
Curt, do you see or know of any use cases where nosql and new SQL compete or the use cases are differentiated ?Also looks like almost all new SQL vendors are in memory.
John,
First, I wouldn’t at all say that NewSQL products are in-memory. VoltDB and MemSQL are in-memory, but Clustrix, Tokutek et al. are no more “in-memory” than NoSQL systems are (and less so than Aerospike). Note also that Couchbase grew out of an in-memory technology as well.
As for competition between NoSQL and NewSQL — absolutely, most especially in the case of aggregations simple enough that NoSQL could conceivably handle them. E.g., MemSQL’s flagship deal is to a considerable extent at Couchbase’s expense, namely at Zynga.
I’m not following how you are parsing the differences. As I see it:
– There are many *for purpose* database engines focusing on specific use cases of data access. You call them NewSQL if they have a SQL API and don’t front a big data stack. A better name for these may be niche vendors; I’m not sure why anyone using them would care if they are in-memory or not, more interesting is what use-cases they manage well and how they manage that.
– There are a lot DBMS API, generally SQL, on big data vendors. These are generally big-data vendors trying to move up the feeding chain from commodity storage/compute services, though some are distributed databases branded as big data. These are niche, but the key here is a MR like processing layer.
The way I tend to look at these is:
– For purpose DBMS. This includes:
– Specialized indexes (neo4j, lucene, replicated distributed KV caches, arguably doc stores fit here, as do Oracle and SQL Server in-memory column stores)
– OLTP specialized models (could be monolithic such as volt or distributed such as couch)
– query specialized models (perhaps with governance and other features allowing scaling) Teradata and Netezza and Vertica and SAS, etc.
– Scalable general purpose DBMS. Mostly RDBMS – think Oracle/Sql Server/DB2
– Alternative traditional general purpose DBMS. MySQLs, Postgres, etc.
My experience is that the choice for the 95% of applications doesn’t really make much of a difference, and people reasonably do what works for them, what fits into the existing stacks as you describe. What is interesting is where there is friction.
The friction could be cost, for example if a company is attempting to compete in data driven business on price of service.
The friction could be a programmer and change driven development march. Programmers, especially in agile environments, have different goals from data modelers. This leads to database-as-persistence designs, and dynamic schema and XML and JSON storage are artifacts of that. This optimizes for app innovation; it uses database for locking and recovery, but not much for query or data consistency.
The glamor outliers are the scale issues. These create performance and license and stability and special case management issues….
Sorry – my indentation was eaten. This should say:
The way I tend to look at the [DBMSes] is:
1. For purpose DBMS. This includes:
1.1 Specialized indexes (neo4j, lucene, replicated distributed KV caches, arguably doc stores fit here, as do Oracle and SQL Server in-memory column stores)
1.2 OLTP specialized models (could be monolithic such as volt or distributed such as couch)
1.3 query specialized models (perhaps with governance and other features allowing scaling) Teradata and Netezza and Vertica and SAS, etc.
2. Scalable general purpose DBMS. Mostly RDBMS – think Oracle/Sql Server/DB2
3. Alternative traditional general purpose DBMS. MySQLs, Postgres, etc.
Aaron,
I find your categories and phrasing confusing. E.g., you seem to be suggesting that NuoDB is less scalable than Oracle. I think you’re using “OLTP” the way I use “short-request”.
I’m guessing that we see the choice of which product to use in similar ways, but I’m not really sure what you’re actually saying. Sorry!
Hi Curt,
A question and a comment…
I very much like the your posts on dynamic schemas… but I wonder what is the trade-off? If you say dynamic vs. static then why would anyone want static? There must be a downside, I guess? Any ideas… as this would help with the theme of your post on when to pick what?
I agree that an in-memory DBMS is not a DBMS with data in-memory. It is a DBMS engine that is optimized around the advantages of no IO. The no-latch approach you mention is a great example but there are others including optimizations to utilize processor cache, the ability to use vector instruction sets, and optimizations for NUMA. There will be more coming, I expect, including optimization around nonvolatile memory.
Rob
Hi Rob,
Queries are easiest and most flexible against a fixed, relational schema. That’s the essence of Ted Codd’s invention.
What we’re seeing now is people trying to invent things that have rather competitive RDBMS functionality and performance while also having rather competitive functionality and performance in dynamic schema uses — and which furthermore have all that goodness against the same data.
Curt,
NoSQL primarily has taken the databases back by couple of decades. Good for puts and gets (basically single row operations). Users have to spend time upfront designing a model which serves a particular application. Schema on read sounds more like a feature. Reporting and analytics obviously are not in a sweet spot of NoSQL engines. For example Datastax provides Hadoop node and Hive for analytics which is questionable. Will the user experience the speed using Hive? Cassandra doesn’t have much to offer when it comes to scanning the data and Hive on top seems like a toy.
thoughts?
John,
I write about these trade-offs often. If it’s easy to design a relational schema that will serve well, one probably should do so, and use an RDBMS. But that’s not always the case, in practice and often even in theory.
Using a hammer to drive in a screw is rarely optimal, and often things are so extreme as for that metaphor to be apt.
Agreed. Question
Why would a user choose cassandra or mongo over memsql or voltdb or clustrix? All of the 3 vendors claim to have support for schema on read feature with JSON type support and they are blazingly fast.
Trying to learn and educate by your wonderful blog!!
JSON support is much newer in the RDBMS than in MongoDB. In most cases, it also is kludgier and less performant than non-relational alternatives that are designed to be non-relational.
MemSQL and VoltDB aren’t designed for databases that spill onto conventional persistent storage. (Basically; there are exceptions, as for example in my recent blog about the latest MemSQL release.)
VoltDB has a weird programming model.
Cassandra has long had geo-distribution.
Etc.
Thanks, Curt…
Is there a dynamic schema implementation you particularly like? I’ll take a look…
Rob
Rob,
I liked Akiban’s hybrid approach, but customers evidently didn’t agree, and the company was acquired.
WibiData Kiji is another kind of best-of-both-worlds attempt.
MarkLogic is also trying best-of-all-worlds, but I haven’t kept up with them since soon after the latest management team change.
And of course there are the various schema-on-need efforts, but those are analytically focused, and I suspect that’s not what you’re asking about.
Very useful discussion.
Scale out+licensing+can live w/out joins line up with what I’ve most seen recently in evaluations of traditional SQL versus NoSQL.
I think the more dynamic schema is also an important when the rate of change is high (as it is for product hierarchies, content taxonomies, customer profile data, and many other domains.
Great article!
Hi Curt,
I enjoyed the article, I am writing a research paper for a graduate school class on NoSQL and this is very helpful.
I have a question regarding using NoSQL. What criteria would you use to determine which NoSQL database to use? More specifically MongoDB vs. Cassandra. Also, regardless of the type of NoSQL, what would be the biggest challenges to a company implementing one for the first time?
I appreciate any feedback you can provide!
Thanks!
Charles
A friend on the Redmond campus told me for many quarters Oracle, DB2, SQL Server have been number 1, 2, 3 by revenue. Till Q1 14 when this flipped. Perhaps this is more the NoSQL products starting to take revenue away from higher end databases.
[…] mixed together true migration and new-app platforms in a post last year about DBMS architecture choices, when I […]
[…] Multiple kinds of data model are viable … […]