DBMS2 revisited
The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:
- Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
- Drastic limitations on relational schema complexity. I think I’ve been vindicated on that one by, for example:
- NoSQL and dynamic schemas.
- Schema-on-read, and its smarter younger brother schema-on-need.
- Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
- Funky database design from major Software as a Service (SaaS) vendors such as Workday and Salesforce.com.
- A whole lot of logs.
I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.
For starters, let me say:
- I’ve mocked the concept of “logical data warehouse” in the past, for its implausible grandiosity, but Gartner’s thoughts on the subject are worth reviewing even so.
- I generally hear that internet businesses have SOAs (Service-Oriented Architectures) loosely coupling various aspects of their systems, and this is going well. Indeed, it seems to be going so well that it’s not worth talking about, and so I’m unclear on the details; evidently it just works. However …
- … evidently these SOAs are not set up for human real-time levels of data freshness.
- ETL (Extract/Transform/Load) is criticized for two reasons:
- People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.
- Both analytic RDBMS and now Hadoop offer the alternative of ELT, in which the loading comes before the transformation.
- There are some welcome attempts to automate aspects of ETL/ELT schema design. I’ve written about this at greatest length in the context of ClearStory’s “Data Intelligence” pitch.
- Schema-on-need defangs other parts of the ETL/ELT schema beast.
- If you have a speed-insensitive problem with the cost or complexity of your high-volume data transformation needs, there’s a good chance that Hadoop offers the solution. Much of Hadoop’s adoption is tied to data transformation.
Next, I’d like to call out what is generally a non-problem — when a query can go to two or more systems for the same information, which one should it tap? In theory, that’s a much harder problem in theory than ordinary DBMS optimization. But in practice, only the simplest forms of the challenge tend to arise, because when data is stored in more than one system, they tend to have wildly different use cases, performance profiles and/or permissions.
So what I’m saying is that most traditional kinds of data integration problems are well understood and on their way to being solved in practice. We have our silos; data is replicated as needed between silos; and everything is more or less cool. But of course, as traditional problems get solved, new ones arise, and those turn out to be concentrated among real-time requirements.
“Real-time” of course means different things in different contexts, but for now I think we can safely partition it into two buckets:
- Human real-time — fast enough so that it doesn’t make a human wait.
- Machine real-time — as fast as ever possible, because machines are racing other machines.
The latter category arises in the case of automated bidding, famously in high-frequency securities trading, but now in real-time advertising auctions as well. But those vertical markets aside, human real-time integration generally is fast enough.
Narrowing the scope further, I’d say that real-time transactional integration has worked for a while. I date it back to the initially clunky EAI (Enterprise Application Integration) vendors of the latter 1990s. The market didn’t turn out to be that big, but neither did the ETL market, so it’s all good. SOAs, as previously noted, are doing pretty well.
Where things still seem to be dicier is in the area of real-time analytic integration. How can analytic processing be tougher in this regard than transactional? Two ways. One, of course, is data volume. The second is that it’s more likely to involve machine-generated data streams. That said, while I hear a lot about a BI need-for-speed, I often suspect it of being a want-for-speed instead. So while I’m interested in writing a more focused future post on real-time data integration, there may be a bit of latency before it comes out.
Comments
2 Responses to “DBMS2 revisited”
Leave a Reply
> People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.
That might a bit strong. I am seeing more examples, in some scenarios, that the fixed schema is the right solution, with some projects migrating away from schemaless.
As always, one approach does not solve all problems.
Re: multi-systems of record non-problem
I respectfully disagree that this is a non-problem. A couple common real-world examples:
– Same entity, horizontally partitioned between different systems with different data models and reference values. E.g. a customer dimension. How do you combine them into one dimension? A: Nasty rollup logic.
– Same entity and record, but contradictory information across two systems (e.g. one says an order is open, the other says it is closed). Which is it? Business rules come into play.
– Two giant tables that have to be joined at the middle tier. Only the most advanced BI products can break down a query and make each system aggregate its respective tables, and join the result sets in a second step. And it’s not trivial to design.
I’ve never been able to assume that only trivial cross-system queries will be required in the real world.