The future of data marts
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:
- Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
- Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
- In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
- Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
- Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching the story:
- Thesis: Enterprise Data Warehouses (EDWs)
- Antithesis: Data Warehouse Appliances
- Synthesis: Greenplum’s Enterprise Data Cloud vision
When put that starkly, it’s overstated, not least because
Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
- Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
- On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
Of course, the EDC vision isn’t quite as new or differentiated as Greenplum ideally would wish one to believe.
- To a first approximation, EDC sounds a lot like what eBay has already built on Teradata equipment.
- Greenplum’s EDC vision also sounds a lot like what Stuart Frost was talking about at DATAllegro, what Dell was planning to build on DATAllegro equipment, and what Stuart continues to talk about now that he’s been acquired into Microsoft.
- Something like EDC can also be presumed to be implicit in the strategies of the other one-size-fits-all vendors — i.e., Oracle and IBM.
- Greenplum has only implemented a little more of the EDC vision so far than have other firms, unless you give it credit for being cheap/fast/MPP/running on commodity hardware, but deny that credit to Teradata (specialized hardware, and not cheap in its most popular configurations), Oracle (ditto for Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not released yet).
- Specifically: In Greenplum Release 3.3, which is being announced today, Greenplum is introducing the (enhanced?) ability for data marts to be spun out as a background operation, while the database otherwise remains functional. As of 3.3, spinning out a data mart is a command-line operation. But in Release 3.4, Greenplum plans to offer a web-based interface for same, at which point the “self-service data mart creation” discussion will become operative. Otherwise, EDC is a roadmap/vision/statement-of-direction much more than it is a fully-baked technical project.
One particular source of potential confusion is Greenplum’s emphasis on the buzzphrase self-service (data mart). This seems to be a conflation of two related concepts:
- End users should be able to create new data marts themselves. Strictly speaking, I view this ability as useless at most enterprises, and important at very few, because of logistical issues. (Who gives the permissions? Who decides which hardware is used?) That said, useless “end user” tools often wind up being important productivity aids for IT professionals, and this kind of “self-service” would surely be another example. Edit: Hmm. Doug Henschen inspired me to think that over again, and I’m beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value.
- End users should be able to manage data marts themselves, once created. That’s a great idea, full of agility and don’t-make-IT-a-roadblock goodness. Data miners and similar analytic professionals commonly have the technical ability to manage a simple database, and should be allowed to do so if it’s ensured that they don’t break anything for anybody else.
One thing that’s needed for this technology to come to full fruition is sophisticated data movement and synchronization. Ideally, some tables in a data mart could be virtual — views against a central database. But others would be physically recopied from the center, with all the ETL/ELT/ETLT/replication issues that entails. Meanwhile, it’s not obvious that the ideal architecture is a simpleminded hub-spoke — perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g. by starting a table out as a view and changing over to making it a physical copy as usage profiles change.
Oliver Ratzesberger of eBay also argues that workload management — not a current Greenplum strength — can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements). More generally, if you’re setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There’s a big virtualization opportunity here — but virtualization requires a lot of system management infrastructure.
Related links
- My recent post on reinventing business intelligence
- Greenplum adviser Joe Hellerstein’s pitch for agile data warehousing
- Charlie Bachman’s “private database” idea, which never went anywhere (pp. 138-139)
- Greenplum’s EDC and Release 3.3 press releases
Comments
30 Responses to “The future of data marts”
Leave a Reply
So is the only difference bet. Vertica and GP in the cloud is that GP has a dedicated infrastructure while Vertica uses EC2?
Not exactly. Greenplum doesn’t offer cloud-based services of any kind at this time. It encourages its customers to build “private clouds”.
Your comment would be closer to accurate if you were contrasting Aster and Kognitio.
Oh so hang on a second – their big announcement today was about on-premise “private GP clouds” and not some dedicated hosted service they provide??
Very confused, isn’t that what Greenplum already offered?
Well it looks like they’re basically saying look, go cobble something using metal/virtual/cloud together and we will fit on top of that. But it’s still your IT ops handling all the provisioning (for their private cloud). In essence what it seems like to me at this point is a set of best practices really relatively inline with their MAD EDW philosophy. Unless I’m missing something which is a distinct possibility – I mean geezus I initially assumed they were providing the cloud infrastructure 🙂
I’d recommend a read of the Greenplum EDC whitepaper at http://www.greenplum.com/resources/complete-library/ (No registration required).
The EDC initiative is about 3 things:
– Platform technology that allow business analysts to self-serve provision warehouses/sandboxes via a web console and access/replicate data into their warehouse from anywhere in the EDC. (i.e. a ‘private cloud’ approach applied to scale-out data warehousing). This is not just about spinning up a database in virtual machines. We’re building a new layer of services that really allow business and IT to each focus on what they do best and reduce the areas of friction that exist today — e.g. self-serve cluster provisioning from server pools, local or geographically remote data replication, data lineage and cross-warehouse metadata, and more.
– A new data warehousing methodology that challenges the formal ‘everything in one database and one data model’ that has been prevalent over the past 25 years. This isn’t something that Greenplum has cooked up — it is simply a reflection of what our customers are putting into practice today.
– An ecosystem of customers and partners that believe in the vision and are working with us to shape and deliver on it.
Note that most enterprises that we work with aren’t looking to the public cloud for data warehousing – largely because the data is being generated in-house and they don’t want to push TBs over the Internet daily. But they do want to achieve many of the touted ‘cloud’ benefits in-house. i.e. They want to empower business analysts to serve themselves without lots of process or IT delays in the way. And they want IT to consolidate infrastructure, get their arms around data mart proliferation, and improve service levels but without some heavy-handed approach that requires unifying all the data models.
Isn’t this another “buzzword bingo” name for something that is pretty common in mature DW environments. Pretty much anyone who has a data mining team use the DW must do this.
Curt has referred to it before at eBay with Oliver Ratzesberger’s Analytics as a Service blog at http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service. See also a presentation from Oliver on this topic at http://www.teradata.com/t/WorkArea/DownloadAsset.aspx?id=5761
So basically, what companies besides Aster, Kognitio and Vertica currently have production cloud implementations today?
[…] only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, […]
[…] notion of self-service data marts has merit, but with certain caveats, Monash said in a blog posting Monday. “Suppose users could order up the data mart they want, perhaps test it at a very low […]
I found this post from Daniel Abadi as a pretty balanced assessment of this news:
http://dbmsmusings.blogspot.com/2009/06/quick-thoughts-on-greenplum-edc.html
For example:
“7. It appears that the only part of the EDC initiative that Greenplum’s new version (3.3) has implemented is online data warehouse expansion (you can add a new node and the data warehouse/data mart can incorporate it into the parallel storage/processing without having to go down). All this means is that Greenplum has finally caught up to Aster Data along this dimension. I’d argue that since Aster Data also has a public cloud version and has customers using it there, they’re actually farther along the EDC initiative than Greenplum is …”
Aster has multiple customer deployments on public clouds — both Amazon and AppNexus. ShareThis is the largest DW-in-a-real-cloud deployment at Amazon (currently at 10 TB) and will be discussing their deployment at TDWI in San Diego in August at the Executive Summit:
http://www.eiseverywhere.com/ehome/index.php?eventid=4983&tabid=929
We’ve been running Greenplum internally on EC2 for almost 2 years now, and use both EC2 and internal VMware pools for a range of QA and scale testing work.
Making Greenplum run on EC2 is almost zero work — we just haven’t seen material demand from large enterprises wanting to put their production, mission critical data warehouses in the public cloud yet. There’s no doubt it’ll come over time, and we’re supportive of the direction, but it just isn’t here yet.
Matt Aslett from the the451 group wrote a nice analysis on this topic (unfortunately only available through paid subscription), where he reinforced this point:
“Enabling cloud-computing deployments is about more than simply offering a version of your product running on Amazon . . . Adoption of data warehousing on public clouds has so far been limited to proofs-of-concept evaluations and trials rather than production deployments, we believe, and Greenplum’s focus on datacenter platforms could serve it well as enterprises look to private cloud architecture as a method of improving datacenter efficiencies before identifying workloads that could be migrated to public clouds.”
We’re encouraged by folks like Aster, Vertica and others that find interest in public cloud offerings to serve the current market of Web 2.0 companies which is definitely a good use case. If anyone is seeing that large enterprises are ready today for meaningful adoption of public cloud services for data warehousing, we’re ready to serve 😉
Ben,
I didn’t think it was possible to stretch the definition of Web 2.0 to the breaking point, but you may have just accomplished it. 😉
Best,
CAM
[…] Read more Author: admin Categories: Data Governance, Data Mart Examples, Data mart Tags: Add new tag Comments (0) Trackbacks (0) Leave a comment Trackback […]
[…] meeting. But while I was there I asked where Netezza stood on concurrency, workload management, and rapid data mart spin-out. Netezza’s claims in those regards turned out to be surprisingly […]
[…] The future of data marts Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search Subscribe to our complete feed! […]
[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]
[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]
Dr:
hi, how are you please I’m student in university i want example for application data mart and explain this example.
[…] area I flat-out forgot to mention is easy data mart spin-out. Categories: Analytic technologies, Business intelligence, Data models and architecture, Data […]
[…] or functionality not just for end users, but also for administrators. In particular, SaaS, cloud, private cloud and/or appliance benefits are commonly concentrated in this […]
[…] Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign. […]
[…] took the opportunity to ask what kinds of data marts (virtual or otherwise) were spun out in […]
Curt, the link to Scott Yara’s own words does not work
Thanks, Naym. That’s totally my fault, and I don’t now know what I had in mind. I’ll just delete the reference.
Big Data Analytics is not only for retail Business Intelligence, even toughh that is where some of the greatest advancements are currently occurring. Big Data Analytics is also the future of Infrastructure Asset Management. Each industry has core infrastructure that must be Asset Managed over its life-cycle through predictive modeling. Big Data Analytics will evolve too rapidly (with increasing volume, variety, velocity and complexity of available classes of data) for any industry organization to standardize and maintain THE method of doing Infrastructure Asset Management through Big Data Analytics.Those wishing to take a leadership role in the Big Data Analytics required for successful Integrated Asset Management of infrastructure need to establish the standards for the backbone of Big Data in their industry sector that is industry associations should establish the standards for data governance, management, control, and compliance through a Central Data Warehouse. Then let consultants, utilities, software companies, and academics knock yourself out and do the analytics you want any way you want to do it and provide competitive differentiation to the businesses. If you want to work with our data scientists, great; if you have your own data scientists or a third-party that helps you, fabulous. But there is only one place that you come to get the data and that’s [the industry association’s Central Data Warehouse.] From the CDW, infrastructure asset design and performance can be independently Validated, Verified (iV&V), and benchmarked against peers (as is done in the software sector). As a civil engineer, I believe this is the future of the engineering standard of care in all sectors and will change engineering practice as we know it.
hello!,I like your writing very a lot! percentage we communicate more about your
article on AOL? I require an expert on this house to resolve my problem.
May be that is you! Having a look ahead to peer you.
[…] has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload […]
Great and informative article.
[…] for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. (2) I can see another use case for departmental solutions. You could set up your first couple of […]