Attack of the Frankenschemas
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to:
- Retailing, especially online. Different kinds of products have different kinds of attributes, making a Grand Cosmic Schema rather complex. Examples I’ve blogged about include:
- Amazon relied on an in-memory object-oriented DBMS for its used books inventory lookup back in 2005.
- A Microsoft customer managed book and DVD inventory in XML the same year.
- More recently, 10gen spoke of a wireless telco offering cell phones and service plans in the same product catalog, built over MongoDB.
- Human resources. Employee-centric applications are naturally full of hierarchies, which can be annoying to flatten. Non-relational approaches I’ve blogged about include Workday’s object model and Neo4j’s graph-based contribution.
- Web log analysis. Web logs can be particularly hard to flatten, as per my post on (that sense of) nested data structures.
- More generally, marketing and other applications that maintain detailed profiles of customers or prospects. The information in these profiles is often based on a large variety of marketing campaigns, third-party databases, and analytic exercises. As the inputs pile up, the schemas get ever hairier.
- Electronic medical records. Medical records are one area where non-relational approaches may actually have majority share. I blogged about one example in 2008.
Or to quote a 2008 post,
Conor O’Mahony, marketing manager for IBM’s DB2 pureXML, talks a lot about one of my favorite hobbyhorses — schema flexibility* — as a reason to use an XML data model. In a number of industries he sees use cases based around ongoing change in the information being managed:
- Tax authorities change their rules and forms every year, but don’t want to do total rewrites of their electronic submission and processing software.
- The financial services industry keeps inventing new products, which don’t just have different terms and conditions, but may also have different kinds of terms and conditions.
- The same, to some extent, goes for the travel industry, which also keeps adding different kinds of offers and destinations.
- The energy industry keeps adding new kinds of highly complex equipment it has to manage.
Conor also thinks market evidence shows that XML’s schema flexibility is important for data interchange. For example, hospitals (especially in the US) have disparate medical records and billing systems, which can make information interchange a chore.
*I now call that dynamic schemas.
So, for fear of Frankenschemas, should we flee from RDBMS altogether? Hardly. For social proof, please note:
- Every application area I’ve cited can be and often is handled via relational techniques.
- Some of the non-relational alternatives I’ve mentioned, such as XML or object-oriented DBMS, haven’t enjoyed a lot of traction.
- Even the most successful NoSQL vendors are tiny when compared to the relational behemoths.
More conceptually, I’d say that the advantages of a relational DBMS start:
- In theory and practice alike, the advantages of normalization and joins.
- In theory and practice alike, the advantages of loose coupling between your database design and your application. (I think that’s a cleaner way of saying it than to focus on “reusing” the database, but it amounts to the same thing.)
- In practice, performance and functionality in anything using indexes, even if joins aren’t involved.
- In practice, maturity and functionality in general.
Those aren’t chopped liver.
Comments
8 Responses to “Attack of the Frankenschemas”
Leave a Reply
What about the hybrid solution of having XML or JSON or BLOB (like MMORPG as you mentioned) or a key-value store inside a relational database for those corner cases? I saw that many times and while probably not the most efficient it did the job.
Sometimes that works. Oracle is probably world’s biggest vendor of enterprise search, or else was for many years. But RDBMS vendors have been surprisingly ineffective at narrowing the gap vs. more focused systems.
Indeed, I expect the NoSQL/NewSQL winners to wind up being systems that can exhibit multiple “personalities” in terms of logical data model.
There seems to be a lot of dancing around the point here.
– In general, structured data with a variety of use needs is amenable to relational model RDBMS.
– There are edge cases relating to scale, model, etc. that NewSQLs are going after. ACID replication so far doesn’t seem one of these.
– There are edge cases around complex read only that some BI centric stores are attacking.
– There are database-trivial applications where KVPs allow fast, scalable, and redundant caching.
– There are structured apps with “soft edges”, where parts of the app have evolving schemas. These are addressed by document stores and RDBMS with XML and/or blobs.
– There are apps that are just going to evolve, such as ones where business goals are in flux. These are served by a variety of noSQLs which allow deferred and evolving schema definition which allow changing goals of the store.
– There are uses where the critical factor is archival and support of scalable storage and scalable processing where noSQLs (and some doc stores) shine.
Speaking as a practioner, preserving a single view of a schema is really hard. Having multiple views may work in a few areas, but is truly complex and is not for general use.
It would be instructive to look at your customer examples over the past few years. Likely, the ones that succeeded slid into more traditional RDBMS (and some noSQL, especially Hadoop), with only edge cases in newSQL and other specialized niche products.
The one thing I will say in favor of RDBMS is that they actually give you a means to do some analysis on your data. The NoSQL’s give you storage and retrieval at scale but totally throw up their hands when you want to do some analysis. The ability to analyze the data in place is what makes the difference between a database and a datastore.
Brian,
Let’s not be quite so fast in saying that. First, a lot of NoSQL systems have at least some primitive analytic capabilities — aggregates, range queries, perhaps a form of map/reduce. Second, for hard core analytics, why do it in a short-request data store of any kind anyway? Third, there’s HBase/Hadoop integration. Fourth, there’s Cassandra/Hadoop integration. Etc.
Aaron,
NewSQL clearly hasn’t had a lot of traction to date. And the one client I pointed toward a Schooner/dbShards combo probably is looking for alternatives on the Schooner side right now. (Meanwhile, the individual who spearheaded that selection jumped to DataStax.)
But there are plenty of robust NoSQL installations, and I don’t include generic Hadoop as NoSQL. (HBase is NoSQL, however, of course.)
I think Brian is pointing out the obvious why (and again your example is instructive). To a make a decision to go to a niche product implies a lot of trust of the vendor AND a willingness to look at a db API as a construction tool – rather than current practice of integrating vendor products. I can’t think of *any* examples I’ve seen of successful serious and unfrustrating BI against a newSQL, even ones with standard SQL JDBC or ODBC interfaces.
It would be really instructive to review *how and why people choose the new products*. The product vendors are of course dealing with the chicken and egg of competitors against mature stacks – you can do better tech for purpose, but you compete against an entire industry.
(And on Hadoop being a DBMS – Pig and Hive are pretty crude, but they seem to be among the largest use/largest growth BI tools right now, so it can be considered as much a db as SAS or Access.)
[…] for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past […]