NoSQL overview
My NoSQL article is finally posted; I hope it lives up to all the foreshadowing. It is being run online at Intelligent Enterprise/Information Week, as per the link above, where Doug Henschen edited it with an admirably light touch.
Below please find three excerpts* that convey the essence of my thinking on NoSQL. For much more detail, please see the article itself.
*Notwithstanding my admiration for Doug’s editing, the excerpts are taken from my final pre-editing submission, not from the published article itself.
My quasi-definition of “NoSQL” wound up being:
NoSQL DBMS start from three design premises:
- Transaction semantics are unimportant, and locking is downright annoying.
- Joins are also unimportant, especially joins of any complexity.
- There are some benefits to having a DBMS even so.
NoSQL DBMS further incorporate one or more of three assumptions:
- The database will be big enough that it should be scaled across multiple servers.
- The application should run well if the database is replicated across multiple geographically distributed data centers, even if the connection between them is temporarily lost.
- The database should run well if the database is replicated across a host server and a bunch of occasionally-connected mobile devices.
In addition, NoSQL advocates commonly favor the idea that a database should have no fixed schema, other than whatever emerges as a byproduct of the application-writing process.
I subdivided the space by saying:
If not SQL, then what? A number of possibilities have been tried, with the four main groups being:
- Simple key-value store.
- Quasi-tabular.
- Fully SQL/tabular.
- Document/object.
DBMS based on graphical data models are also sometimes suggested to be part of NoSQL, as are the file systems that underlie many MapReduce implementations. But as a general rule, those data models are most effective for analytic use cases somewhat apart from the NoSQL mainstream.
My conclusion was:
So should you adopt NoSQL technology? Key considerations include:
- Immaturity. The very term “NoSQL” has only been around since 2009. Most NoSQL “products” are open source projects backed by a company of fewer than 20 employees.
- Open source. Many NoSQL adopters are constrained, by money or ideology, to avoid closed-source products. Conversely, it is difficult to deal with NoSQL products’ immaturity unless you’re comfortable with the rough-and-tumble of open source software development.
- Internet orientation. A large fraction of initial NoSQL implementations are for web or other internet (e.g., mobile application) projects.
- Schema mutability. If you like the idea of being able to have different schemas for different parts of the same “table,” NoSQL may be for you. If you like the database reusability guarantees of the relational model, NoSQL may be a poor fit.
- Project size. For a large (and suitable) project, the advantages of NoSQL technology may be large enough to outweigh its disadvantages. For a small, ultimately disposable project, the disadvantages of NoSQL may be minor. In between those extremes, you may be better off with SQL.
- SQL DBMS diversity. The choice of SQL DBMS goes far beyond the “Big 3-4” of Oracle, IBM DB2, Microsoft SQL Server, and SAP/Sybase Adaptive Server Anywhere. MySQL, PostgreSQL, and other mid-range SQL DBMS – open source or otherwise – might meet your needs. So might some of the scale-out-oriented startups cited above. Or if your needs are more analytic, there’s a whole range of powerful and cost-effective specialized products, from vendors such as Netezza, Vertica, Aster Data, or EMC/Greenplum.
Bottom line: For cutting-edge applications – often but not only internet-centric — NoSQL technology can make sense today. In other use cases, its drawbacks are likely to outweigh its advantages.
Related link
Comments
18 Responses to “NoSQL overview”
Leave a Reply
[…] NoSQL overview | DBMS2 : DataBase Management System Services […]
Obe of best overview I have seen.
But I’m still puzzled, how do you query (not lookups!) those data locked in NoSQL DBMS? The standard answer is: by developing some imperative code in you favorite language and running it using MapReduce executor. Well… excuse me, but I’ll say it is pretty lame argument. You do need some declarative querying language. And I think there is a lot to querying even without joins, Google BigQuery/Dremel stuff going this way. I wonder if they are categorized as NoSQL.
You can check my writeup for more information on this point http://bigdatacraft.com/archives/135
I also wonder why enterprise java cache solutions or as they are called Enterprise Memory Fabric/Grid and so on are not considered as NoSQL. They were predecessor to this. Examples include: Israeli GigaSpaces is on market for a decade now, Tangosol similar time (CEO, Cameron Purdy has unfinished book “Enterprise Hashmap” that he was writing some 7-8 years ago but after selling his company to Oracle seems to abandon it, how is that for NoSQL?). Tangosol for example had some decent query language as I remember, even a special CEP extensions akin StreamSQL. It could work as front-end for DBMS or as stand-alone data store. Is it NoSQL solution?
A few notes and comments on the about from the MongoDB POV:
– Transactions are important and joins are useful! However, it is very hard, if not impossible, to scale a database with these features horizontally (especially on commodity hardware or on commodity networks). Thus the removal of joins and complex transactions in this space.
– Light transactional semantics can fit with nosql: it is really complex transactional semantics that is the problem. MongoDB and BigTable (I think?) for example support atomic operations on single “objects”.
– It’s not all about scale. There are many very happy developers using projects that run on MongoDB that will never use more than one or two servers. Easier development, in addition to scale and speed, is one of the NoSQL value propositions. If we must give up relational because distributed joins are hard, we might as well try then to innovate around the data model and do something useful there too. The lower impedance mismatch of programmatic objects to say, JSON, is helpful, as is the dynamic nature of the schemas of these systems.
– I believe that at least in the case I’m familiar with, 10gen, we have created a development and support track highly suitable for enterprises to make commercial usage of MongoDB and know they can get support, quality, and a predictable product road map.
@Camuel MongoDB has a declarative query language — although it is not SQL.
Hi Dwight,
There are many kinds of joins and transactions that are difficult to scale, but there are also many that can. I can think of at least four companies attempting to commercialize this approach and I work for one of them. I disagree that all transactions and joins need to be given up in exchange for scalability.
Transactions and joins scale when operations are done against data available at a single shard via the shard key or against infrequently updated replicated data.
Certain kinds of transactions across a subset of shards can also be scaled.
I think MongoDB could support transactions, and not just atomicity, within a single document without a huge amount of effort. I also think it’s a matter of when and not if.
@dwight
I don’t consider the MongoDB query language declarative but maybe that is just me (Maybe working with Oracle for a long time has limited my vision too much). When you want to do aggregation you often need to write MapReduce functions in JavaScript.
But the .NET drivers for MongoDB support LINQ so I can write declarative code that is translated into MongoDB query language or JavaScript MapReduce functions by those drivers.
I don’t think that it is very hard to write a ‘translator’ that translates a declarative statement into a statement that the NoSQL db understands. I think that this isn’t very difficult because the NoSQL db’s don’t support joins. This translation can be done in the driver or the database itself.
@Ariel,
Agree with a number of things you say. With certain assumptions, you can do joins. For example in the business intelligence space, it can be practical as a star schema, and the read oriented-ness of a data warehouse, make it easier. So to me a greenplum/aster/vertica make perfect sense.
Also, on a really fast low latency network, it might work. But i think if we don’t make a lot of assumptions on schema, and one wants to scale up to 100s or 1000s of machines on commodity hardware, it isn’t going to work.
And yes, one could do a ‘transaction’ on a single document – mongodb v1.8 has durability; rollback isn’t very meaningful anyway on a single object; so yes…
@rc The non-aggregate operations in MongoDB are declarative. So, yes, you are right re : aggregation. Although we anticipate having more declarative ways to do aggregation in the future (not soon).
@dwight
Thank you for reply. I checked the MongoDB site for another fresh look. Well… while the query language looks okay and I guess provides the functionality equivalent to basic SQL (and will catch-up with more advanced SQL capabilities), I don’t see a point in bringing brand new query language in. Why not just implement SQL? At bare list it will instantly look familiar to thousands if not millions of developers. I’m sure MongoDB team have a good answer, however I haven’t found it yet.
SQL can be easily extended if for any special functionality, it still be better than a totally new query language.
For example Google adapted SQL for nested data and another dialect for their BigTable and there are SQL extensions (on various stages of standardization) for time-series, spatial, bio-data and etc. etc.
I heard critique saying that SQL string literals “pollute” source-code, particularly their syntax is not checked in compile time. Very valid point regarding compile time syntax-check. But in my opinion a solution looks more like LINQ making embedded-SQL (or SQL-like) a first-class programming language construct (or newer term – an embedded DSL :)) if you prefer.
What makes MongoDB special requiring completely new query language?
Curt,
I’ve definitely enjoyed reading your article, but I kind of disagree with many of the points you’ve made. While coming here to drop a note, it was interesting to find Dwight’s comment which seems to be reconfirming my counter-arguments: http://nosql.mypopescu.com/post/1293051486/nosql-basics-benefits-and-best-fit-scenarios
bests
Dwight, Alex, et al.,
Only a NoSQL extremist who says NoSQL should be used for anything can or should be accused of saying transactions, joins, etc. (of more than minor complexity) are universally useless. I don’t know of anybody at that extreme, so that’s a straw man.
Your more substantive disagreement with me, I think, is that I didn’t pay enough attention to the matter of schema flexibility, non-tabular schemas, whatever. You have a point on that one. One can’t emphasize everything in a single article. I didn’t completely leave schema issues out, but I also didn’t dwell on them.
@Camuel it’s a very good question about SQL. in fact this could be added at any time — there is an experimental JDBC driver for MongoDB today. the following post has more details. to me this has always been a close call, I was pretty surprised when the comments went so much the way they did:
http://blog.mongodb.org/post/447761175/should-mongodb-use-sql-as-a-query-language
Here an example of mapping Sql aggregation to MongoDB MapReduce: http://rickosborne.org/blog/index.php/2010/02/19/yes-virginia-thats-automated-sql-to-mongodb-mapreduce/
Interesting you should mention Cache’. Probably should consider GT.M too. See http://www.mgateway.com/docs/universalNoSQL.pdf
[…] Dwight Merriman and some other folks thought I underrated the importance of schema flexibility as a reason to go NoSQL. […]
[…] Basics, Benefits and Best-Fit Scenarios has Curt Monash look at NoSQL for the uninitiated, see also his blog for some more comments (him & others) on the […]
It seems as though most people have forgotten the original reason the RDBMS was invented, and why it took off so quickly: It’s perspective neutral, and doesn’t require and provides a declarative query layer.
Databases like IDS/IDMS, IMS, Cincom Total, Adabas, and System 2000 are all “NoSQL” and are still used because of their ability to perform well for certain applications, in particular banking. IBM still sells boatloads of IMS databases.
I also believe the NoSQL movement has not lived up to its promises. Apparently Digg had major uptime and data quality issues after moving over to Cassandra. Apparently Facebook doesn’t even use Cassandra, except for inbox searching (they still mainly rely on sharded MySQL and Memcache). Even Twitter only uses Cassandra for internal analytics.
I agree that non-RDBMS systems have a purpose, but the reasons simply come down to performance. Most businesses are not as hardware bound, as they are constrained by short-sighted application-centric data architecture.
The problems we face are more a challenge of management and culture than raw technology.
I tend to agree with what Neil said. The current generation who are born with a mobile telefon in one hand and a portable pc in the other hand do not know what happened in the 1980s. Database discussions, E-R Model, CODASYL, Codd and Date, System-R and CP/M and DOS. Not everyone needs to know all of these, just as one doesn’t need to read Newton’s original thesis to learn Physics today.
But it will be fair at least the database people are aware of some of these.
Relational Databases were easy to make … Just a two-dimensional table and some ‘sql’ front-end.. you have a dtabase on any platform. Easy to use and easy to implement — and some vendors and shift to new platforms … the move to relational was ‘complete’..
But the original DBMSs had much more…Many like IDMS came with a fully intergrated dictionary, online development and runtime environment and a built-in 4GL.
Today the failure of relational model is apparent in the light of Java’s need to implement domain model.
The basic issue here is that real life data is NOT relational, but hierarchical!!! Many new applications are trying re-invent the wheel now.. Cook up some kind of linked objects from a relational database… Use meta data in relational tables to implement what can be called an ‘old’ CODASYL structure!
Lastly databases like IDMS did implement an SQL layer without compromising the performance. This is seldom mentioned by pundits in the field.
see http://www.reocities.com/idmssql/idms-sql-database.htm for an article from 2005