Data models
7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
To make the subject somewhat manageable, I’ll focus on fielded data — i.e. data that represents values of something — rather than, for example, video or images. Fielded data always arrives as a string of bits, whose meaning boils down to a set of <name, value> pairs. Here by “string of bits” I mean mainly a single record or document (for example), although most of what I say can apply to a whole stream of data instead.
Important distinctions include:
- Are the field names implicit or explicit? In relational use cases field names tend to be implicit, governed by the metadata. In some log files they may be space-savingly implicit as well. In other logs, XML streams, JSON streams and so on they are explicit.
- If the field names are implicit, is any processing needed to recover them? Think Hadoop or Splunk acting on “dumb-looking” log data.
- In any one record/document/whatever, are the field names unique? If not, then the current data model is not relational.
- Are the field names the same from one record/document/whatever to the next? I.e., does the data fit into a consistent schema?
- Is there a structure connecting the field names (and if so what kind)? E.g., hierarchical documents, or relational foreign keys.
Some major data models can be put into a fairly strict ordering of query desirability by noting:
- The best thing to query is a relational DBMS. Everything has a known field name, so SELECTs are straightforward. You also have JOINs, which are commonly very valuable. And RDBMS are a mature technology with in many cases great query performance.
- The next-best thing to query is another kind of data store with known field names. In such data stores:
- SQL or SQL-like SELECTs will still work, or can easily be made to do.
- Useful indexing systems can be grafted on to them (although they are typically less mature than in RDBMS).
- In the (mainly) future, perhaps JOINs can be grafted on as well.
- The worst thing to query is a data store in which you only have a schema on read. You have to do work to make the thing queryable in the first place.
Unsurprisingly, that ordering is reversed when it comes to writing data.
- The easiest thing to write to is a data store with no structure.
- Next-easiest is to write to a data store that lets you make up the structure as you go along.
- The hardest thing to write to is a relational DBMS, because of the requirements that must be obeyed, notably:
- Implicit field names, governed by metadata.
- Unique field names within any one record.
- The same (ordered) set of field names for each record — more precisely, a limited collection of such ordered sets, one per table.
And so, for starters, most large enterprises will have important use cases for data stores in all of the obvious categories. In particular:
- Usually it is best to have separate brands of general-purpose/OLTP (OnLine Transaction Processing) and analytic RDBMS. Further:
- I have in the past also advocated for a mid-range — i.e. lighter-weight — general purpose RDBMS.
- SAP really, really wants you to use HANA to run SAP’s apps.
- You might want an in-memory RDBMS (MemSQL) or a particularly cloudy one or whatever.
- Your website alone is reason enough to use a NoSQL DBMS, most likely MongoDB or Cassandra. And it often makes sense to have multiple NoSQL systems used for different purposes, because:
- They’re all immature right now, with advantages over each other.
- The apps you’re using them for are likely to be thrown out in a few years, so you won’t have great pain switching if you ever do decide to standardize.
- Whatever else Hadoop is — and it’s a lot of things — it’s also a happy home for log files. And enterprises have lots of log files.
Beyond that:
- You may want something to manage organizational hierarchies and so on, if you build enough custom systems in areas such as security, knowledge management, or MDM (Master Data Management). I’m increasingly persuaded by the argument that this should be a graph DBMS rather than an LDAP (Lightweight Directory Access Protocol) system.
- Splunk is cool.
- Use cases for various other kinds of data stores can often be found.
- Of course you’ll be implicitly using whatever is bundled into your SaaS (Software as a Service) systems, your app-specific appliances and so on.
And finally, I think in-memory data grids:
- Will be widely used and important.
- Will be used to instantiate multiple data models at once.
Related links
- One reason for writing this post was for some deck-clearing before I revisit the white-hot topic of data streaming. (October, 2014)
- I’ve long mused about the challenges of getting by without joins. (November, 2010)
- In 2013 I observed that data models will be in perpetual, rapid flux.
- In 2013 I also discussed attempts to combine multiple data models (or access methods) in a single DBMS.
- I surveyed data models and access methods back in 2008.
Comments
8 Responses to “Data models”
Leave a Reply
Thanks for the writeup Curt, excellent brain-food! Funny you should mention I had somewhat related thoughts over the weekend: http://datadventures.ghost.io/2015/02/22/principia-data/
Cheers,
Michael
Hi Curt,
“In relational use cases field names tend to be implicit”
Is that correct, or did you mean explicit?
Thanks,
Chris
For RDF (not necessarily representative of graphs) – the same advantages of a declarative query language hold, with a few improvements – no explicit joins (there are no tables/relations), graph-path patterns via path regexes and schema being expressible as data (i.e in RDF). The last part means that schema is a perspective, not a fixed thing – and substitutable over a given ‘base dataset’. This is pretty powerful – as schema information can be added/updated completely after the fact. It’s powerful in a head-exploding kind of way, which probably explains why many people don’t get it or use it.
Chris,
I meant what I wrote. The field names aren’t directly carried with the data.
To be fair, however — I wasn’t thinking of what the SQL would look like.
I also see schema-less, document oriented data model -when we store arbitrary JSONs and can search for them by values in arbitrary fields.
To clarify a bit I would tell that each document do have very clear structure, this ready to be queried, but different documents do not have to share the same structure.
This model is used by two very popular systems – MongoDB and ElasticSearch.
Fair enough. Queries can be cross-column, especially if they are resolved by some kind of text search. Attivio’s founders tried to popularize that at FAST and then at Attivio, without a lot of success in either place that I could discern. RDBMS have in some cases added it, I think. I think MarkLogic is pretty good about it too.
From my perspective the biggest shift I am seeing is in the analytical/BI space.
I believe the future of data warehousing is not in databases but rather in file systems. I don’t think you even need HDFS. I would argue that NTFS is actually more appropriate for the majority organizations: Friction-less access from pretty much any tool you can think of; most tables are not that big or can be easily broken down into small partitions; and NTFS has a more robust security model. But there’s not a big push here due in part to the confluence of incentives and disincentives between vendors and IT staff, and the fact that Windows is generally seen as uncool.
On the transaction processing side of things, I think key/value stores will continue to flourish, but their scope of business consistency is really only a single key/value pair. Relational Sharding might not be as cheap to scale as key/value, but it does provide a much larger scope of controlled business consistency. But for the time being, this fundamental trade-off between Availability and Consistency gets short shrift in IT and IT journal circles. Or it’s argued in an obtuse round-about way that nobody understands. So I expect Key/Value will continue to flourish, there may be a shift back to Relational Sharding once a broader awareness emerges of what the real trade-offs are, and not the ‘straw-man’ trade-offs many vendors put out there.
[…] I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of […]