Using multiple data stores
I’m commonly asked to assess vendor claims of the kind:
- “Our system lets you do multiple kinds of processing against one database.”
- “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”
So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:
- The most extreme vendor marketing claims are false.
- There are many different choices that make sense in at least some use cases each.
Horses for courses
It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:
- Short-request vs. analytic.
- SQL vs. non-SQL (NoSQL or otherwise).
- Expensive/heavy-duty vs. cheap/easy-to-support.
Vendors are part of this consensus; already in 2005 I observed
For all practical purposes, there are no DBMS vendors left advocating single-server strategies.
Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.
Multiple data stores for a single application
We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree:
- It’s common and sensible to manage authentication and authorization data in its own data store. Commonly, the data format is LDAP (Lightweight Directory Access Protocol).
- It’s common and sensible to manage the “content” and “e-commerce transaction records” aspects of websites separately.
- Even beyond that case, there are often performance reasons to manage BLOBs (Binary Large OBjects) outside your relational database.
- Internet “interaction” data is also often best managed outside an RDBMS, in part because of its very non-tabular data structures.
The spectacular 2010 JP Morgan Chase outage was largely caused, I believe, by disregard of these precepts.
There also are cases in which applications dutifully get all their data via SQL queries, but send those queries to two or more DBMS. Teradata is proud that its systems can support rather transactional queries (for example in call-center use cases), but the same application may read from and write to a true OTLP database as well.
Further, many OLTP (OnLine Transaction Processing) applications do some fraction of their work via inbound or outbound messaging. Many buzzwords can come into play here, including but not limited to:
- SOA (Service-Oriented Architecture). This is the most current and flexible one.
- EAI (Enterprise Application Integration). This was a hot concept in the late 1990s, but was generally implemented with difficulties that SOA was later designed to alleviate.
- Message-oriented middleware (MOM) and Publish/Subscribe. These are even older, and overlap greatly.
Finally, every dashboard that combines information from different data stores could be assigned to this category as well.
Multiple storage approaches in a single DBMS
In theory, a single DBMS could operate like two or more different ones glued together. A few functions should or must be centralized, such as administration, and communication with the outside world (connection handling, parsing, etc.). But data storage, query execution and so on could for the most part be performed by rather loosely coupled subsystems. And so you might have the best of both worlds — something that’s multiple data stores in the ways you want that diversity, but a single system in how it fits into your environment.
I discussed this idea last year with cautious optimism, writing:
So will these trends succeed? The forgoing caveats notwithstanding, my answers are more Yes than No.
- … multi-purpose DBMS will likely always have performance penalties, but over time the penalties should become small enough to be affordable in most cases.
- …
- Machine-generated data and “content” both call for multi-datatype DBMS. And taken together, those are a large fraction of the future of computing. Consequently …
- … strong support for multiple datatypes and DMLs is a must for “general-purpose” RDBMS. Oracle and IBM [have] been working on that for 20 years already, with mixed success. I doubt they’ll get much further without a thorough rewrite, but rewrites happen; one of these decades they’re apt to get it right.
In 2005 I had been more ambivalent, in part because my model was a full 1990s-dream “universal” DBMS:
IBM, Oracle, and Microsoft have all worked out ways to have integrated query parsing and query optimization, while letting storage be more or less separate. More precisely, Oracle actually still sticks everything into one data store (hence the lack of native XML support), but allows near-infinite flexibility in how it is accessed. Microsoft has already had separate servers for tabular data, text, and MOLAP, although like Sybase, it doesn’t have general datatype extensibility that it can expose to customers, or exploit itself to provide a great variety of datatypes. IBM has had Oracle-like extensibility all along, although it hasn’t been quite as aggressive at exploiting it; now it’s introduced a separate-server option for XML.
That covers most of the waterfront, but I’d like to more explicitly acknowledge three trends:
- Among other things, Hadoop is a collection of DBMS (HBase, Impala, et al.) that in some cases are very loosely coupled to each other. The question is less how well the various data stores work together, and more how mature any one of them is on its own.
- The multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in Hadoop.
- Even on the relational side, multiple storage capabilities exist in one product.
- Vertica was designed that way from the get-go. (Like the old joke about police duos, one is to read and one is to write.)
- IBM, Microsoft and Oracle have all recently added some kind of in-memory columnar capability.
- Teradata, Aster (before Teradata bought them), Greenplum and Vertica all added some variant on row/column dual stores.
Related links
- SQL vs. NoSQL, legacy vs. clean-up. (March, 2014)
- The difficulty of DBMS development, including Hadoop-based ones (March, 2013)
Comments
4 Responses to “Using multiple data stores”
Leave a Reply
I agree, that especially in the world of big data there is no silver bullet. We have number of “full table scan” engines like Impala, Hive, Presto, etc as well as “index access” engines also called NoSQLs (too much to list). Each one in both families has its own strength and weaknesses.
In the same time duplicating (or triplicating) data among engines is expensive.
I think we will get to the multiple engine – single data paradigm, where different engines are looking into data stored in some open format. It is problematic for mutating data, but IMHO it is feasible for read-only data…
As a company man, I can’t help but point out an old work by one of the founders.
http://cs.brown.edu/~ugur/fits_all.pdf
We tried something along these lines at Sybase back in the 90s. We were playing around with microkernels and started a proto to split the monolithic rdbms into a set of services. Some services as you pointed out are common, but the idea was that multiple data stors could be accomodated. Performance did suffer but not too badly.
To compete with Informix, ASE was being enhanced to accomodate text, geospatial and time series data stores but wih a postgres style foreign data wrapper strategy. The desire for a more coherent approach led to this experiment.
IBM worked with us to see if these services could be run as OS level services on top of the os2 3.0 microkernel. So these services would sit as peers to
the aix and os2 virtual machines.
A little too ahead of its time ! Thinking of reviving the approach in open source, kudu has some
useful components for this approach. Drizzle tried the same approach if I am not mistaken.
Let me see if I can get some fellow co-conspirators.
Hiya!
The impression I got from Jacob Stein was that performance wasn’t good at all, in large part due to the way different kinds of data(type) management wanted to use RAM. Was this perhaps a different Sybase project trying to meet the same market challenge?