The Naming of the Foo
Let’s start from some reasonable premises.
- No technology category name is ever perfect.
- It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about.
- That said, it seems pretty clear that NoSQL is about making big websites (and perhaps other cloud-like installations) run and scale.
- Dwight Merriman (founder/CEO of MongoDB vendor 10gen) is heading in the right direction when he says that the unifying ideas of NoSQL are that you do away with transactions and joins. But if he’s ever said something like “NoSQL is Foo without joins and transactions,” I don’t know what Foo is.
- Actually, I do know what Foo is – Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. I just don’t know what Foo is called.
- Obviously, Foo is a lot like OLTP (OnLine Transaction Processing). However, it would be pretty silly for Foo to actually be OLTP, given that one of the core points of NoSQL is that you don’t have transactions.
- It not just the “T” part of OLTP that’s fried. Calling something “OnLine” only makes sense as long as offline is an option, and offline transaction processing has been obsolete for a very long time.*
*Sure, if you strain you can talk yourself into exceptions. But the point stands.
So we need a name for Foo, where Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. Thus, three major subcategories of more-or-less disk-based Foo are:
- No-compromises ACID-compliant relational OLTP
- Sharded MySQL
- NoSQL
There may be some more purely memory-centric versions too, but let’s put those aside for the moment.
Absent a better idea, I can squeeze Foo into yet another four-letter acronym:
HVSP (High-Volume Simple Processing)
That’s as imperfect as any other category name, and an awkward mouthful to boot. So I’d love to hear a better one; if you have such, please share it! In the mean time, I think “HVSP” has merit because:
- The “Processing” part should be noncontroversial.
- “High-Volume” is inherent to the challenge. If RDBMS scale well enough for your use case, using something less powerful is probably silly.* Similarly, while Oracle shines at high-volume OLTP workloads, there are many cheaper DBMS that do a fine job of OLTP at lower volumes.
- “Simple” is the core principle of NoSQL systems, which drop joins and transactions as being too much foofarah. That only makes sense at all under the assumption that you have bone-simple queries and updates, so that programming around the lack of joins and transactions isn’t all that much of a burden.
- Something similar is true of sharded MySQL.
- Less obviously, “simple” is a core principle of relational OLTP as well. The point of the relational model is to cap the complexity of data operations, or more precisely to hide that complexity from programmers.
- And overloading the word “simple” a bit, it’s fair to say that if you’re reading or writing one record at a time, you’re doing something relatively simple, at least as opposed to what you do in analytic processing. The OLTP vs. OLAP distinction is preserved in this name change.
- The whole thing matches my definition above, namely “what happens when lots of people want to get small amounts each of information in or out of a database at the same time.”
*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.
Systems I’m leaving out of the HVSP and hence also NoSQL categories include:
- Hadoop and other batch-oriented MapReduce. Hadoop isn’t part of NoSQL. I’m pretty sure that Cloudera CEO Mike Olson agrees with me.
- More generally, non-SQL data stores that don’t meet the HVSP criteria. Dave Kellogg stretches things when he claims that MarkLogic is a NoSQL system. (But then, that was in a post where he seemingly praised a train wreck of an article.)
But hey – what good is a categorization if it doesn’t leave some things out?
Comments
37 Responses to “The Naming of the Foo”
Leave a Reply
Good work here. I agree the category needs a positive name that says what it is rather than what it is not. However, in my experience a big part of the reluctance to define a category is because it restricts the solution space.
In this case, how would you respond to systems like voltDB or hstore which try to move or remove the boundary between the data management tier and the apparently tier? If you express you business logic in language the DB understands (but not SQL) then the DB can do you a lot of favors.
I think a lot of what is driving NoSQL is developers who learned to treat their DB as a dumb key-value store, and are now realizing that if all they want is key-value plus a bit, there are better options than MySQL. That’s a fine conclusion, but maybe they’re not solving the right problem.
In-memory DBMS that only use disk as an afterthought would probably be a fourth subcategory. I ducked that subject because I’m not confident I know the full range of emerging contenders out there.
VoltDB is the one I’m most familiar with.
Sharded MySQL implies no joins over your entire data set. But there are still joins done within the shard.
There are dramatic differences between members of the NoSQL family. Some require sharding, others (HBase, Cassandra) do not. Most are crash safe but a few are not.
MongoDB looks a lot like sharded MyISAM with better replication (not crash safe, tables support single-writer or multiple-readers). Is this a radical change?
I wish there were a better way to describe systems. NoSQL versus SQL doesn’t do it. But a ‘geek code’ for the attributes of a transaction processing system might not catch on — sharded/unsharded, async/sync replication, strong/eventual consistency, …
Massively Distributed Eventually Consistent Processing (MDECP)
maybe?
To me the key characteristics are the built in distributness and the eventual consistency model.
Eventual consistency is only one of the options. That’s true even w/in NoSQL (I’m just working on that post now), let alone when we also include RDBMS possibilities.
[…] = HVSP (High Volume Simple Processing) without joins or explicit […]
I think the real issue here in nailing down NoSQL, is that I wonder if over time these DBMSs will evolve the same or similar capabilities as a traditional RDBMS.
Application developers tend to view the RDBMS as a “bit bucket” to persist their application data. The problem is, that developers don’t have the perspective to see the bigger picture. In particular, they don’t really concern themselves with reporting or data interoperability. Developers also don’t think in a declarative set-based way. At one of the previous places I worked, it took me 6 months to convince the developers to use an ETL tool. They were adament, and even coded the ETL in Java. It took a long time to develop and was eventually deemed a failure. It was only then that they decided to look at the ETL approach.
I had the same problem with Hibernate (and ORM layer). They wanted to create all the data models in Hibernate. I allowed this since the applications were all one-offs and the data was never going to be shared outside of the app. But this approach would be problematic once new applications had to share data elements which were tailored to a legacy app.
My big concern with the NoSQL approach is that developers will make a beeline for it, and it will become the defacto way of developing applications. Sure it’s great if you’re the next Facebook (fat chance), but for most applications, this means putting a tremendous amount of data management back into the developer.
To whit, over the summer I developed a data entry application for a small pharmaceutical. They had hundreds of medical records, with a fairly complex schema that needed entering. However, I was able to develop this application far faster than any Java or C# developer since I let the RDBMS do all the work for me purely through modeling in the RDMBS (i.e. 3NF, integrity constraints, cascading deletes, etc.). The front-end MS Access forms were purely configured with not a single line of procedural code. Users could enter data, delete records, search by field, etc. etc. And, I was able to run all sorts of reports to monitor data quality, not to mention the extracts that would go to the statistician for analysis.
I look at something like Cassandra, and think that that it’s a huge step backwards for all but the biggest web sites out there.
It will be interesting to see what happens. I can see a lot of different scenarios playing out.
[…] is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no […]
I think the most fundamental requirement for these new systems is that they can be partitioned across multiple nodes. As Neil said, after they get the partitioning ironed out, it will be interesting to see them continue to add the same features that make relational database popular.
Maybe they are all trying to become a free version of Oracle RAC?
Point of order:
Partitioning data is trivially easy. What’s hard is getting the system behavior you want after you’ve partitioned it.
GLENDOWER
I can call spirits from the vasty deep.
HOTSPUR
Why, so can I, or so can any man;
But will they come when you do call for them?
I would replace “Simple” in your definition with “Data”. “Simple” just reflects the current state of NoSQL technology. The key word here is “current”.
MarkLogic Server is designed for scale. Sharding is the technique used for that – which is commonly considered the best approach in the NoSQL community. Of course there’s a lot more to it’s architecture, but it’s quite obvious after reading your post that you are not aware of any of this.
So how can you call someone that got informed and wrote an article a “train-wreck” when they are only guilty of doing a better work in research than you did.
@Vlad – but it’s not data anymore, it’s about information. That’s part of the change – it’s documents, it’s many things, but mostly it’s not about tables (at least not exclusively)
@Vlad,
If it’s not simple, then SQL or a substitute is harder to live without.
From memory, when last I talked with Mark Logic they weren’t doing much in the way of high throughput, at least on the levels commonly associated with NoSQL. Mark Logic seemed more focused on doing complex things with decent performance than doing simple things with great performance.
If that’s no longer true, Dave and the gang have done an uncharacteristically poor job of marketing MarkLogic’s new capabilities.
@Curt I’m available to clarify, but I think these keywords will help you understand how MarkLogic Server works right now:
sharding, high availability, strict consistency, mvcc, fragmentation, failover.
A lot of information for one line – but probably for you it all makes sense quite easily. Take care
@Nuno,
Do you think any part of http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/ is wrong or out of date? If so, which?
But upon rereading that — on the one hand, I’m wondering whether I was a little too quick to dismiss Dave’s claim. On the other, I still don’t know of Mark Logic pursuing the kinds of applications we’d normally associate w/ MySQL.
Hi Curt,
I don’t think the problem is on the analysis you made but the focus of the presentation you saw. It was not focused on scaling.
The focus was on the search, the indexing, getting information out of the database really fast. I give you that integrated full text search is not something that is common in NoSQL, it normally requires integration with third party solutions like lucene and solr. But I just recently saw someone from MongoDB claiming that it is in their goals to get some further degree of control on the search. This is probably they recognize how much faster their searchs would be if they had an “universal index” in a sharding architecture. Like MarkLogic Server does 🙂
I can tell you an example application in dead sure would run fantastic in MarkLogic Server: Twitter
Some key differentiators would be:
– integrated full text search
– enrichment of the status with common things people query
– geospatial queries in the same indexing space
– sharding
– collections
– reverse queries to find un-anticipated relations.
I think they would actually be able to do things they assume impossible and choose not to support.
Train wreck was a little harsh to describe the IEEE article, Curt and I praised it to say it was a good article to hand the CIO — you know those folks who don’t spend all day worrying about database internals like some of us do.
If you know of a better CIO-level NoSQL article, please share it’s URL. I don’t doubt that you could write one.
And it would start with a good taxonomy of NoSQL systems. And it’s simply illogical to not include XQuery systems in an [un]category called NoSQL.
Yes, NoSQL is about CAP, but not entirely. And I kind of like your HVSP but I think it will also be about analytics.
Check out the Wikipedia “structured storage” page — perhaps a definition, as opposed to an un-definition — would solve the problem.
Dave,
I admit to some prejudice because of the horrible process that led up to the article. But:
1. He claims that handling unstructured data better than relational systems do is central to NoSQL. Huh?
2. He claims that it’s hard to join across nodes of an MPP RDBMS. Huh? It might be slow, but it’s not hard. Similar errors in that vein abound.
3. He repeats the too-common columnar can’t be relational error.
4. He calls constraints restraints, even though I corrected that error for him during the research process.
5. He’s totally confused as to whether SQL is complicated or querying without SQL is complicated.
6. He randomly calls NoSQL systems “applications”.
7. He doesn’t acknowledge that some NoSQL systems — notably MongoDB — have companies behind them offering support contracts.
Yes, the article is a train wreck. And even if that’s over-harsh, your praise was way over-effusive. 😉
Curt:
“If it’s not simple, then SQL or a substitute is harder to live without.”
So what? Surely it is going to be not SQL, think about it as a low-level language-agnostic API, which will allow developers to get full access to a distributed storage internals, but there will be no joins and transactions, so technically it is still noSQL. I just do not agree with “simplicity” in your definition of noSQL.
@Vlad Rodionov
Nothing wrong with a low level language-agnostic API. However I do think that some people will use that API to develop a declarative-set-based query language on top of that API.
I spent some time writing MapReduce functions on MongoDB (in the weekends I dabble in MongoDB) to find duplicates. It is certainly possible but time consuming to write all that code.
select x, count(*)
from mytable
group by x
having count(*) > 1
isn’t so bad at all.
@Vlad,
Ultimately, a DBMS or substitute technology is a big DML interpreter. So one of the top criteria is that the language(s) supported be conducive to efficient and effective programming. In many use cases, SQL is a fine language for that purpose. E.g., when joins are inherent to the problem (and not just artifacts of low-benefit normalization), SQL is apt to be a great choice.
Some relational advocates would say “low-benefit normalization” is a contradiction in terms. Fine. But even if one disagrees with them, there are plenty of cases where joins come in very handy.
Thanks Curt.
I was previously unaware of your post about the IEEE article and the process behind it.
I do think one of the things NoSQL is about is unstructured data. The value in key, value pairs might be a tweet or a profile entry or a webpage.
I think I will try to write a better CIO’s Guide to NoSQL myself when I get some time.
Best,
Dave
Dave,
I thought my link to that post, in the bullet point that mentioned your name — and indeed in the very words you disputed — would have been your first clue. 😉
I look forward to your guide.
Best,
CAM
[…] Monash took up this naming issue in his recent “naming of the foo” post, one of three that he published in quick succession about NoSQL. In that post he […]
[…] And that, folks, is a big part of why the NoSQL folks are so negative about joins. […]
Interesting post, in order to see clearer into NoSQL offers.
Thanks.
Here’s how I see NoSQL: IMHO, I simply put that NoSQL databases are disguised object databases! With relaxed contraints (e.g. relaxed consistency, no transaction, no join, etc.).
Here is my post:
http://www.jroller.com/dmdevito/entry/thinking_about_nosql_databases_classification
for more details.
[…] is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL […]
How about
OLRP = Online Request Processing systems.
(As in Web request processing.)
That’s pretty good, actually!
[…] the column-group-architecture guys — have probably had the most bang-in-lots-of-writes HVSP production […]
[…] The Naming of the Foo | DBMS2 : DataBase Management System Services […]
Curt,
Sorry if I missed your references to this term in other postings.
Daniel Abadi has an interesting post (http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html) that contains the quote: “In other words, NoSQL really means NoACID.”
I don’t know if NoACID is a better term than your proposed HVSP, but in my opinion NoACID does a better job than NoSQL at implicitly conveying the premise behind the cloud (pardon the pun) of NoSQL.
One problem with the terms NoSQL and NoACID is that they try to tell you how that product category is NOT like another product category, rather than how it is similar to yet other product categories, as Dave K. mentions above.
Your thoughts?
Scott R.
Hi Scott,
I certainly agree that ACID is a central issue, as per — for example — http://www.dbms2.com/2010/09/21/acid-compliant-transaction-integrity/ , especially the D.
Marton Trensceni also had a good idea, as per http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/
Best,
CAM
[…] high-volume simple processing […]
The SQL language is very powerful and much of that power depends upon table joins. I wouldn’t advise using a db without joins unless you have no choice.