Juggling analytic databases
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes.
The idea of an analytic data store separate from your transactional one has been around since before the relational era. Its approximate evolution was:
- The DBMS aspects of 4GLs such as Focus.
- IBM’s relational “Information Center”, back when IBM thought transactions should still be in IMS.
- Early days of Teradata.
- Early days of MOLAP, when Ted Codd argued the analytic store should not be relational.
- The growth of relational data warehousing.
- The rise of the big bit bucket.
In the past, large DBMS vendors liked to argue that enterprises should have a single analytic data store — commonly known as an enterprise data warehouse (EDW) — but that theory holds ever less water. A sample of my writing on that subject includes:
- Two July, 2011 posts on eight kinds of analytic database.
- An April, 2010 debunking of the myth of the enterprise data warehouse.
- Lots of Hadoop coverage.
Recently, the big vendors have capitulated. In particular:
- Teradata introduced “purpose-built” data warehousing appliance product lines with a variety of configurations and price points.
- Teradata bought Aster Data.
- Teradata deemphasized the term “EDW” in favor of “IDW” (Integrated Data Warehouse), which is technically like an EDW, but doesn’t have to hold absolutely all your analytic data.
- IBM bought Netezza.
- IBM then put out a marketing concept of “Smart Consolidation“, which incorporates four different kinds of analytic data store or quasi-store — DB2, Netezza, Hadoop (“BigInsights”), and Streams.**
- Oracle introduced Exalytics and the Oracle Big Data Appliance, to go with Exadata.
- Teradata made a Barney announcement with Hortonworks, emphasizing its love for Hadoop as a companion to Teradata Classic and Aster technology.***
- MarkLogic made an announcement with Hortonworks even more Barney than Teradata’s.
* Teradata also uses the term “ADW”, for Active Data Warehouse, which in essence means “Low latency! High concurrency! Rah rah rah!”
**Calling that “Smart Consolidation” is like naming a swinger club “Smart Fidelity”. But terminology aside, I endorse the idea.
***Teradata definitely expects its Hortonworks relationship to ascend beyond the Barney level; Tasso Argyros gave enough NDA details to be convincing about that. But it’s not there yet.
So data marts should often be managed by different technology than your core IDW. But even if you want to use the same technology, there are good reasons to have separate data marts, including the desire to manage:
- Derived data based on other data already in the data warehouse.
- Data that had never been put into the data warehouse in the first place.
- Data you get from outside your enterprise.
In each case, the point is that:
- Your normal data governance bureaucracy is an obstacle — reasonable or otherwise — to analysis of a particular set of data, but …
- … a separate data mart can serve as a safe workaround.
I call this data mart spin-out, and am no longer sure where I first picked up that term. Oliver Ratzesberger popularized the concept when he was at eBay, and then Greenplum ran with it.
More precisely, Greenplum ran with it from a marketing standpoint. Delivery of what eventually became Chorus was more like a crawl.
Data mart spin-out can be either physical, in which case there’s real data (re)copying going on, or virtual, in which case the whole thing is being done as a trick in the core DBMS software, especially its workload management subsystem. Virtual spin-out is faster, more flexible, and less costly, all else being equal. But it does lead to a more complex mixed-workload scenario, which you’re relying on your workload management technology to sort out.
Anyhow:
- Star customer Oliver built his version of the idea, which is virtual, on Teradata gear, so virtual is naturally the way Teradata itself has gone.
- Sybase has also gone with virtual data mart spin-out in Sybase IQ.
- ParAccel’s approach is also on the virtual side, but assumes a SAN (Storage-Area Network).
- Greenplum, last I looked, was on the physical side, apparently because that was all they could pull off.
So where is this all going? Mark Beyer of Gartner came up with the term “Logical Data Warehouse” three years ago, and evidently has been trying to refine its definition ever since. Forrester Research has been known to mention similar-sounding ideas. At this point, Gartner still seems to be trying to recreate the EDW fallacy at a higher level of abstraction, which is going to work even less well than EDWs did.
Informatica, which one might think would be the biggest fan of the idea, doesn’t seem to have embraced it yet. But then, the whole thing sounds somewhat like Oracle’s 1990s Project Sedona, which was one of the bigger fiascos in software history, and certainly was the greatest failure of Informatica CEO Sohaib Abbasi’s distinguished career.
My own opinion is:
- It’s good for data stores, data sources, and data sinks to be accessible in as consistent a way as possible, and …
- … cataloguing data stores, sources, and sinks in some sort of live way is a worthy endeavor, but …
- … universal data mediators will never work, because tighter coupling will often be needed for reasons of performance, reliability, security, privacy, and/or economic/legal relationship.
Of course, one can retreat to saying “OK, but how about partly-universal, in line with the quasi-EDWs many enterprises have”? On that basis, I think some of the ideas of the “Logical Data Warehouse” will hold up, for example the ones that amount to glorified MDM (Master Data Management), and probably some of the ILM (Information Lifecycle Management) ones as well. The kind of low-level “Let’s build a mini-Facebook to keep track of and talk about our data stores” collaboration that Oliver open-sourced on his way out of eBay — and that seems to be part of Greenplum Chorus too — could also succeed.
But if you’re looking for some kind of logical/virtual Grand Data Unification — well, that won’t work any better than any other Grand Data Unification idea has over the past 40 years
Comments
14 Responses to “Juggling analytic databases”
Leave a Reply
“But if you’re looking for some kind of logical/virtual Grand Data Unification – well, that won’t work any better than any other Grand Data Unification idea has over the past 40 years”
That whole thing, and then you just punt? We’re just stuck with duck tape and bubble gum from now on? I think it won’t work until it does and then these will look like the dark ages. Like when you had to program for every type of computer in a different language, and there were dozens of types of computers.
People still use a multitude of languages to get their programming work done.
Oliver’s ‘virtual’ data marts on Teradata at a previous employer are very much physical and not logical/virtual, a point discussed at some length about 6 months ago on the Teradata masters mailing list.
The concept of a ‘virtual data mart’, consisting of views, has been around at Teradata for 10 years or more.
Anyhow, irrespective of the technologies or architectures in play, the aim should remain the same – to give users what they want/need at an acceptable cost to all involved.
For some this may mean a single ‘EDW’ style platform is sufficient, for others it may mean a core data warehouse with delivery data marts on a different technology.
There are many platform/architecture choices available, which I personally view as ‘a good thing’…so long as they are used appropriately.
Since the 1990s at AT&T and WalMart, Teradata has known customers needed multiple “central data warehouses”. Since 2007, we have been designing machines to fit specific workloads, some of which are your 8 kinds of analytic DB. Neverthless, we still favor consolidation and integration of data into the fewest number of systems: they provide more value than marts and in the long run are cheaper to own.
Oliver’s data marts inside the Teradata box are now a product called “Data Lab”. These are marts inside the big box and can be joined to the production EDW data. Cool! And every mart has an expiration date on it. So its a great sandbox for power users aimed at “agile” analytics. We owe you a demo.
Great blog, especially the last couple paragraphs and the rah rah rah.
We’ve been into this problem for a while eh! 🙂
Great discussion though.
At this moment we run Greenplum with datamarts, it made us possible “unify” more data access types than the previous version of our DW using other database technology, however, now, this is becoming impossible to analyze certain levels of information (event level on the internet) with Greenplum and hadoop appears for that kind of analytics. EMC has a solution for that, but I do not think that is the way to go in this case, since the needs here are fuzzier and requires more “elasticity” what is your opinion about a Virtual DW combining housed DW and possibly Hadoop on the cloud? crazy? not?
Regards!
The ParAccel link after the physical vs. virtual discussion goes to a news article about a NEO approach – http://www.theregister.co.uk/2012/03/15/asteroid_near_miss/ – is this just happening to me, or to others, too?
John,
That should be http://www.dbms2.com/2011/02/03/paraccel-padb-technical-notes/ Fixing now. Thanks!
Alfredo,
That seems like a lot longer discussion that can be kicked off with a fuzzy couple-sentence question. 🙂
I can’t even tell yet if your real problem is performance/scaling/efficiency, or if it’s achieving tight integration among analytic techniques that your current architecture keeps a bit separate from each other.
[…] Logical data warehouse would seem to be a related concept. […]
[…] of this makes sense. But Gartner has been talking about the “logical data warehouse” for a long time without ever seeming to firm up what it is, as evidenced for example by some dueling summaries of […]
[…] March, 2012 post on various vendors’ admissions that multiple analytic database systems are needed. Categories: Data integration and middleware, Data warehouse appliances, Data warehousing, […]
annayoga.Tumblr.com
Juggling analytic databases | DBMS 2 : DataBase Management System Services
[…] Juggling analytic databases (March, 2012) […]
[…] claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data […]