March 23, 2014

DBMS2 revisited

The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:

Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
Drastic limitations on relational schema complexity. I think I’ve been vindicated on that one by, for example:
- NoSQL and dynamic schemas.
- Schema-on-read, and its smarter younger brother schema-on-need.
- Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
- Funky database design from major Software as a Service (SaaS) vendors such as Workday and Salesforce.com.
- A whole lot of logs.

I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.

For starters, let me say:

I’ve mocked the concept of “logical data warehouse” in the past, for its implausible grandiosity, but Gartner’s thoughts on the subject are worth reviewing even so.
I generally hear that internet businesses have SOAs (Service-Oriented Architectures) loosely coupling various aspects of their systems, and this is going well. Indeed, it seems to be going so well that it’s not worth talking about, and so I’m unclear on the details; evidently it just works. However …
… evidently these SOAs are not set up for human real-time levels of data freshness.
ETL (Extract/Transform/Load) is criticized for two reasons:
- People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.
- Both analytic RDBMS and now Hadoop offer the alternative of ELT, in which the loading comes before the transformation.
There are some welcome attempts to automate aspects of ETL/ELT schema design. I’ve written about this at greatest length in the context of ClearStory’s “Data Intelligence” pitch.
Schema-on-need defangs other parts of the ETL/ELT schema beast.
If you have a speed-insensitive problem with the cost or complexity of your high-volume data transformation needs, there’s a good chance that Hadoop offers the solution. Much of Hadoop’s adoption is tied to data transformation.

Next, I’d like to call out what is generally a non-problem — when a query can go to two or more systems for the same information, which one should it tap? In theory, that’s a much harder problem in theory than ordinary DBMS optimization. But in practice, only the simplest forms of the challenge tend to arise, because when data is stored in more than one system, they tend to have wildly different use cases, performance profiles and/or permissions.

So what I’m saying is that most traditional kinds of data integration problems are well understood and on their way to being solved in practice. We have our silos; data is replicated as needed between silos; and everything is more or less cool. But of course, as traditional problems get solved, new ones arise, and those turn out to be concentrated among real-time requirements.

“Real-time” of course means different things in different contexts, but for now I think we can safely partition it into two buckets:

Human real-time — fast enough so that it doesn’t make a human wait.
Machine real-time — as fast as ever possible, because machines are racing other machines.

The latter category arises in the case of automated bidding, famously in high-frequency securities trading, but now in real-time advertising auctions as well. But those vertical markets aside, human real-time integration generally is fast enough.

Narrowing the scope further, I’d say that real-time transactional integration has worked for a while. I date it back to the initially clunky EAI (Enterprise Application Integration) vendors of the latter 1990s. The market didn’t turn out to be that big, but neither did the ETL market, so it’s all good. SOAs, as previously noted, are doing pretty well.

Where things still seem to be dicier is in the area of real-time analytic integration. How can analytic processing be tougher in this regard than transactional? Two ways. One, of course, is data volume. The second is that it’s more likely to involve machine-generated data streams. That said, while I hear a lot about a BI need-for-speed, I often suspect it of being a want-for-speed instead. So while I’m interested in writing a more focused future post on real-time data integration, there may be a bit of latency before it comes out.

Categories: About this blog, Business intelligence, Database diversity, EAI, EII, ETL, ELT, ETLT, Investment research and trading, NoSQL, Schema on need

Subscribe to our complete feed!

Comments

2 Responses to “DBMS2 revisited”

MattK on March 23rd, 2014 3:54 pm

> People associate it with the kind of schema-heavy relational database design that’s now widely hated, and the long project cycles it is believed to be bring with it.

That might a bit strong. I am seeing more examples, in some scenarios, that the fixed schema is the right solution, with some projects migrating away from schemaless.

As always, one approach does not solve all problems.
SteveF on March 26th, 2014 8:18 pm

Re: multi-systems of record non-problem

I respectfully disagree that this is a non-problem. A couple common real-world examples:

– Same entity, horizontally partitioned between different systems with different data models and reference values. E.g. a customer dimension. How do you combine them into one dimension? A: Nasty rollup logic.

– Same entity and record, but contradictory information across two systems (e.g. one says an order is open, the other says it is closed). Which is it? Business rules come into play.

– Two giant tables that have to be joined at the middle tier. Only the most advanced BI products can break down a query and make each system aggregate its respective tables, and join the result sets in a second step. And it’s not trivial to design.

I’ve never been able to assume that only trivial cross-system queries will be required in the real world.

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

DBMS2 revisited

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin