A new logical data layer?
I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
Trivial query routing or federation is … trivial.
- Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
- Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.
In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.
Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.
Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.
There are ways to navigate schema messes. Sometimes they work.
- Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
- Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.
Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.
I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.
One way or another, I don’t think the subject of logical data layers is going away any time soon.
Related link
- Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)
Comments
3 Responses to “A new logical data layer?”
Leave a Reply
It’s funny to see what’s happening. There used to be good old RDBMSs. They can do lots of things – transactions, aggregations, lookups and joins. They’re just hard to scale. So the gang of NoSQLs showed up, and each one took a piece of RDBMS functionality. Some built indexes but no scans, some had full text search, some had joins and group by (map reduce) but no lookups, etc. And now we see attempts to build a layer on top of all this just to gain back what we had with our trusty swiss army knife, RDBMS.
Nobody ever thought if it would be easier to scale rdbms by building their sharding layer instead of creating their sharded-kv-store and adding features on top ?
I don’t think federation is only about scaling, rather preserving and reusing data, logic and capabilities (e.g. to scale) already built in underlying / source systems.
My preferred solution is – instead of building a superior logical data layer on top of all systems – embedding the data federation / virtualization functionality into all the databases, enabling to mix data and resources with other systems. This way no trade-offs required to use one or the other (SQL vs NoSQL or anything) all the time. Use your actually preferred in front – and utilize the values of the rest in the background.
disclaimer:
Not surprisingly I am one founder of VirtDB, an early phase data virtualization product which does this.