March 13, 2010

The Naming of the Foo

Let’s start from some reasonable premises.

No technology category name is ever perfect.
It’s particularly hard to describe NoSQL (Not Only SQL) accurately, given the basic confusion as to what NoSQL is all about.
That said, it seems pretty clear that NoSQL is about making big websites (and perhaps other cloud-like installations) run and scale.
Dwight Merriman (founder/CEO of MongoDB vendor 10gen) is heading in the right direction when he says that the unifying ideas of NoSQL are that you do away with transactions and joins. But if he’s ever said something like “NoSQL is Foo without joins and transactions,” I don’t know what Foo is.
Actually, I do know what Foo is – Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. I just don’t know what Foo is called.
Obviously, Foo is a lot like OLTP (OnLine Transaction Processing). However, it would be pretty silly for Foo to actually be OLTP, given that one of the core points of NoSQL is that you don’t have transactions.
It not just the “T” part of OLTP that’s fried. Calling something “OnLine” only makes sense as long as offline is an option, and offline transaction processing has been obsolete for a very long time.*

*Sure, if you strain you can talk yourself into exceptions. But the point stands.

So we need a name for Foo, where Foo is what happens when lots of people want to get small amounts each of information in or out of a database at the same time. Thus, three major subcategories of more-or-less disk-based Foo are:

No-compromises ACID-compliant relational OLTP
Sharded MySQL
NoSQL

There may be some more purely memory-centric versions too, but let’s put those aside for the moment.

Absent a better idea, I can squeeze Foo into yet another four-letter acronym:

HVSP (High-Volume Simple Processing)

That’s as imperfect as any other category name, and an awkward mouthful to boot. So I’d love to hear a better one; if you have such, please share it! In the mean time, I think “HVSP” has merit because:

The “Processing” part should be noncontroversial.
“High-Volume” is inherent to the challenge. If RDBMS scale well enough for your use case, using something less powerful is probably silly.* Similarly, while Oracle shines at high-volume OLTP workloads, there are many cheaper DBMS that do a fine job of OLTP at lower volumes.
“Simple” is the core principle of NoSQL systems, which drop joins and transactions as being too much foofarah. That only makes sense at all under the assumption that you have bone-simple queries and updates, so that programming around the lack of joins and transactions isn’t all that much of a burden.
Something similar is true of sharded MySQL.
Less obviously, “simple” is a core principle of relational OLTP as well. The point of the relational model is to cap the complexity of data operations, or more precisely to hide that complexity from programmers.
And overloading the word “simple” a bit, it’s fair to say that if you’re reading or writing one record at a time, you’re doing something relatively simple, at least as opposed to what you do in analytic processing. The OLTP vs. OLAP distinction is preserved in this name change.
The whole thing matches my definition above, namely “what happens when lots of people want to get small amounts each of information in or out of a database at the same time.”

*Assuming, of course, that rows-and-tables are a good metaphor for your data structure in the first place.

Systems I’m leaving out of the HVSP and hence also NoSQL categories include:

Hadoop and other batch-oriented MapReduce. Hadoop isn’t part of NoSQL. I’m pretty sure that Cloudera CEO Mike Olson agrees with me.
More generally, non-SQL data stores that don’t meet the HVSP criteria. Dave Kellogg stretches things when he claims that MarkLogic is a NoSQL system. (But then, that was in a post where he seemingly praised a train wreck of an article.)

But hey – what good is a categorization if it doesn’t leave some things out?

Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, MarkLogic, NoSQL, OLTP, Theory and architecture

Subscribe to our complete feed!

Comments

37 Responses to “The Naming of the Foo”

Richard Tibbetts on March 13th, 2010 8:24 pm

Good work here. I agree the category needs a positive name that says what it is rather than what it is not. However, in my experience a big part of the reluctance to define a category is because it restricts the solution space.

In this case, how would you respond to systems like voltDB or hstore which try to move or remove the boundary between the data management tier and the apparently tier? If you express you business logic in language the DB understands (but not SQL) then the DB can do you a lot of favors.

I think a lot of what is driving NoSQL is developers who learned to treat their DB as a dumb key-value store, and are now realizing that if all they want is key-value plus a bit, there are better options than MySQL. That’s a fine conclusion, but maybe they’re not solving the right problem.
Curt Monash on March 14th, 2010 2:16 am

In-memory DBMS that only use disk as an afterthought would probably be a fourth subcategory. I ducked that subject because I’m not confident I know the full range of emerging contenders out there.

VoltDB is the one I’m most familiar with.
Mark Callaghan on March 14th, 2010 12:28 pm

Sharded MySQL implies no joins over your entire data set. But there are still joins done within the shard.

There are dramatic differences between members of the NoSQL family. Some require sharding, others (HBase, Cassandra) do not. Most are crash safe but a few are not.

MongoDB looks a lot like sharded MyISAM with better replication (not crash safe, tables support single-writer or multiple-readers). Is this a radical change?

I wish there were a better way to describe systems. NoSQL versus SQL doesn’t do it. But a ‘geek code’ for the attributes of a transaction processing system might not catch on — sharded/unsharded, async/sync replication, strong/eventual consistency, …
Unholyguy on March 14th, 2010 1:05 pm

Massively Distributed Eventually Consistent Processing (MDECP)

maybe?

To me the key characteristics are the built in distributness and the eventual consistency model.
Curt Monash on March 14th, 2010 4:58 pm

Eventual consistency is only one of the options. That’s true even w/in NoSQL (I’m just working on that post now), let alone when we also include RDBMS possibilities.
Toward a NoSQL taxonomy | DBMS2 -- DataBase Management System Services on March 14th, 2010 7:25 pm

[…] = HVSP (High Volume Simple Processing) without joins or explicit […]
Neil Hepburn on March 14th, 2010 9:04 pm

I think the real issue here in nailing down NoSQL, is that I wonder if over time these DBMSs will evolve the same or similar capabilities as a traditional RDBMS.
Application developers tend to view the RDBMS as a “bit bucket” to persist their application data. The problem is, that developers don’t have the perspective to see the bigger picture. In particular, they don’t really concern themselves with reporting or data interoperability. Developers also don’t think in a declarative set-based way. At one of the previous places I worked, it took me 6 months to convince the developers to use an ETL tool. They were adament, and even coded the ETL in Java. It took a long time to develop and was eventually deemed a failure. It was only then that they decided to look at the ETL approach.
I had the same problem with Hibernate (and ORM layer). They wanted to create all the data models in Hibernate. I allowed this since the applications were all one-offs and the data was never going to be shared outside of the app. But this approach would be problematic once new applications had to share data elements which were tailored to a legacy app.

My big concern with the NoSQL approach is that developers will make a beeline for it, and it will become the defacto way of developing applications. Sure it’s great if you’re the next Facebook (fat chance), but for most applications, this means putting a tremendous amount of data management back into the developer.

To whit, over the summer I developed a data entry application for a small pharmaceutical. They had hundreds of medical records, with a fairly complex schema that needed entering. However, I was able to develop this application far faster than any Java or C# developer since I let the RDBMS do all the work for me purely through modeling in the RDMBS (i.e. 3NF, integrity constraints, cascading deletes, etc.). The front-end MS Access forms were purely configured with not a single line of procedural code. Users could enter data, delete records, search by field, etc. etc. And, I was able to run all sorts of reports to monitor data quality, not to mention the extracts that would go to the statistician for analysis.

I look at something like Cassandra, and think that that it’s a huge step backwards for all but the biggest web sites out there.

It will be interesting to see what happens. I can see a lot of different scenarios playing out.
Categorizing the “Foo” fighters – making sense of NoSQL — Too much information on March 15th, 2010 1:43 pm

[…] is interesting to see fellow analyst Curt Monash facing the same problem. As he notes, while there seems to be a common theme that “NoSQL is Foo without joins and transactions,” no […]
Matt Corgan on March 15th, 2010 2:19 pm

I think the most fundamental requirement for these new systems is that they can be partitioned across multiple nodes. As Neil said, after they get the partitioning ironed out, it will be interesting to see them continue to add the same features that make relational database popular.

Maybe they are all trying to become a free version of Oracle RAC?
Curt Monash on March 15th, 2010 2:27 pm

Point of order:

Partitioning data is trivially easy. What’s hard is getting the system behavior you want after you’ve partitioned it.

GLENDOWER

I can call spirits from the vasty deep.

HOTSPUR

Why, so can I, or so can any man;
But will they come when you do call for them?
Vlad Rodionov on March 15th, 2010 2:29 pm

I would replace “Simple” in your definition with “Data”. “Simple” just reflects the current state of NoSQL technology. The key word here is “current”.
Nuno Job on March 15th, 2010 2:29 pm

MarkLogic Server is designed for scale. Sharding is the technique used for that – which is commonly considered the best approach in the NoSQL community. Of course there’s a lot more to it’s architecture, but it’s quite obvious after reading your post that you are not aware of any of this.

So how can you call someone that got informed and wrote an article a “train-wreck” when they are only guilty of doing a better work in research than you did.
Nuno Job on March 15th, 2010 2:30 pm

@Vlad – but it’s not data anymore, it’s about information. That’s part of the change – it’s documents, it’s many things, but mostly it’s not about tables (at least not exclusively)
Curt Monash on March 15th, 2010 2:36 pm

@Vlad,

If it’s not simple, then SQL or a substitute is harder to live without.
Curt Monash on March 15th, 2010 2:41 pm

From memory, when last I talked with Mark Logic they weren’t doing much in the way of high throughput, at least on the levels commonly associated with NoSQL. Mark Logic seemed more focused on doing complex things with decent performance than doing simple things with great performance.

If that’s no longer true, Dave and the gang have done an uncharacteristically poor job of marketing MarkLogic’s new capabilities.
Nuno Job on March 15th, 2010 3:39 pm

@Curt I’m available to clarify, but I think these keywords will help you understand how MarkLogic Server works right now:

sharding, high availability, strict consistency, mvcc, fragmentation, failover.

A lot of information for one line – but probably for you it all makes sense quite easily. Take care
Curt Monash on March 15th, 2010 4:49 pm

@Nuno,

Do you think any part of http://www.dbms2.com/2008/10/05/marklogic-architecture-deep-dive/ is wrong or out of date? If so, which?

But upon rereading that — on the one hand, I’m wondering whether I was a little too quick to dismiss Dave’s claim. On the other, I still don’t know of Mark Logic pursuing the kinds of applications we’d normally associate w/ MySQL.
Nuno Job on March 15th, 2010 5:39 pm

Hi Curt,

I don’t think the problem is on the analysis you made but the focus of the presentation you saw. It was not focused on scaling.

The focus was on the search, the indexing, getting information out of the database really fast. I give you that integrated full text search is not something that is common in NoSQL, it normally requires integration with third party solutions like lucene and solr. But I just recently saw someone from MongoDB claiming that it is in their goals to get some further degree of control on the search. This is probably they recognize how much faster their searchs would be if they had an “universal index” in a sharding architecture. Like MarkLogic Server does 🙂

I can tell you an example application in dead sure would run fantastic in MarkLogic Server: Twitter

Some key differentiators would be:
– integrated full text search
– enrichment of the status with common things people query
– geospatial queries in the same indexing space
– sharding
– collections
– reverse queries to find un-anticipated relations.

I think they would actually be able to do things they assume impossible and choose not to support.
Dave Kellogg on March 15th, 2010 8:59 pm

Train wreck was a little harsh to describe the IEEE article, Curt and I praised it to say it was a good article to hand the CIO — you know those folks who don’t spend all day worrying about database internals like some of us do.

If you know of a better CIO-level NoSQL article, please share it’s URL. I don’t doubt that you could write one.

And it would start with a good taxonomy of NoSQL systems. And it’s simply illogical to not include XQuery systems in an [un]category called NoSQL.

Yes, NoSQL is about CAP, but not entirely. And I kind of like your HVSP but I think it will also be about analytics.

Check out the Wikipedia “structured storage” page — perhaps a definition, as opposed to an un-definition — would solve the problem.
Curt Monash on March 15th, 2010 9:30 pm

Dave,

I admit to some prejudice because of the horrible process that led up to the article. But:

1. He claims that handling unstructured data better than relational systems do is central to NoSQL. Huh?

2. He claims that it’s hard to join across nodes of an MPP RDBMS. Huh? It might be slow, but it’s not hard. Similar errors in that vein abound.

3. He repeats the too-common columnar can’t be relational error.

4. He calls constraints restraints, even though I corrected that error for him during the research process.

5. He’s totally confused as to whether SQL is complicated or querying without SQL is complicated.

6. He randomly calls NoSQL systems “applications”.

7. He doesn’t acknowledge that some NoSQL systems — notably MongoDB — have companies behind them offering support contracts.

Yes, the article is a train wreck. And even if that’s over-harsh, your praise was way over-effusive. 😉
Vlad Rodionov on March 16th, 2010 1:50 am

Curt:
“If it’s not simple, then SQL or a substitute is harder to live without.”

So what? Surely it is going to be not SQL, think about it as a low-level language-agnostic API, which will allow developers to get full access to a distributed storage internals, but there will be no joins and transactions, so technically it is still noSQL. I just do not agree with “simplicity” in your definition of noSQL.
RC on March 16th, 2010 4:57 am

@Vlad Rodionov

Nothing wrong with a low level language-agnostic API. However I do think that some people will use that API to develop a declarative-set-based query language on top of that API.

I spent some time writing MapReduce functions on MongoDB (in the weekends I dabble in MongoDB) to find duplicates. It is certainly possible but time consuming to write all that code.

select x, count(*)
from mytable
group by x
having count(*) > 1

isn’t so bad at all.
Curt Monash on March 16th, 2010 1:02 pm

@Vlad,

Ultimately, a DBMS or substitute technology is a big DML interpreter. So one of the top criteria is that the language(s) supported be conducive to efficient and effective programming. In many use cases, SQL is a fine language for that purpose. E.g., when joins are inherent to the problem (and not just artifacts of low-benefit normalization), SQL is apt to be a great choice.

Some relational advocates would say “low-benefit normalization” is a contradiction in terms. Fine. But even if one disagrees with them, there are plenty of cases where joins come in very handy.
Dave Kellogg on March 18th, 2010 9:06 am

Thanks Curt.

I was previously unaware of your post about the IEEE article and the process behind it.

I do think one of the things NoSQL is about is unstructured data. The value in key, value pairs might be a tweet or a profile entry or a webpage.

I think I will try to write a better CIO’s Guide to NoSQL myself when I get some time.

Best,
Dave
Curt Monash on March 18th, 2010 9:43 am

Dave,

I thought my link to that post, in the bullet point that mentioned your name — and indeed in the very words you disputed — would have been your first clue. 😉

I look forward to your guide.

Best,

CAM
Search Facets » The hyping of the NoSQL foo on March 22nd, 2010 11:47 am

[…] Monash took up this naming issue in his recent “naming of the foo” post, one of three that he published in quick succession about NoSQL. In that post he […]
RYW (Read-Your-Writes) Consistency explained | DBMS2 -- DataBase Management System Services on May 1st, 2010 12:57 am

[…] And that, folks, is a big part of why the NoSQL folks are so negative about joins. […]
Dominique De Vito on May 18th, 2010 3:01 pm

Interesting post, in order to see clearer into NoSQL offers.
Thanks.

Here’s how I see NoSQL: IMHO, I simply put that NoSQL databases are disguised object databases! With relaxed contraints (e.g. relaxed consistency, no transaction, no join, etc.).

Here is my post:
http://www.jroller.com/dmdevito/entry/thinking_about_nosql_databases_classification
for more details.
I’m collecting data points on NoSQL and HVSP adoption | DBMS2 -- DataBase Management System Services on August 18th, 2010 9:09 am

[…] is up to 2, the deadline is next week and, crucially, it has been agreed that I may talk about HVSP in general, NoSQL and SQL […]
Marton Trencseni on August 25th, 2010 10:57 am

How about

OLRP = Online Request Processing systems.

(As in Web request processing.)
Curt Monash on August 25th, 2010 2:06 pm

That’s pretty good, actually!
More on NoSQL and HVSP (or OLRP) | DBMS 2 : DataBase Management System Services on August 26th, 2010 5:15 am

[…] the column-group-architecture guys — have probably had the most bang-in-lots-of-writes HVSP production […]
NoSQL Daily – Wed Sep 22 › PHP App Engine on September 22nd, 2010 4:16 am

[…] The Naming of the Foo | DBMS2 : DataBase Management System Services […]
Scott R. on September 30th, 2010 10:17 pm

Curt,

Sorry if I missed your references to this term in other postings.

Daniel Abadi has an interesting post (http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html) that contains the quote: “In other words, NoSQL really means NoACID.”

I don’t know if NoACID is a better term than your proposed HVSP, but in my opinion NoACID does a better job than NoSQL at implicitly conveying the premise behind the cloud (pardon the pun) of NoSQL.

One problem with the terms NoSQL and NoACID is that they try to tell you how that product category is NOT like another product category, rather than how it is similar to yet other product categories, as Dave K. mentions above.

Your thoughts?

Scott R.
Curt Monash on September 30th, 2010 11:15 pm

Hi Scott,

I certainly agree that ACID is a central issue, as per — for example — http://www.dbms2.com/2010/09/21/acid-compliant-transaction-integrity/ , especially the D.

Marton Trensceni also had a good idea, as per http://www.dbms2.com/2010/08/26/nosql-hvsp-olrp/

Best,

CAM
数据仓库工作负载分类 | Alex的个人Blog on October 14th, 2011 11:13 pm

[…] high-volume simple processing […]
karim on November 26th, 2011 5:27 am

The SQL language is very powerful and much of that power depends upon table joins. I wouldn’t advise using a db without joins unless you have no choice.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The Naming of the Foo

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin