June 8, 2009

The future of data marts

Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:

Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
Consolidating data marts onto one common technological platform has important benefits.

In essence, Greenplum is pitching the story:

Thesis: Enterprise Data Warehouses (EDWs)
Antithesis: Data Warehouse Appliances
Synthesis: Greenplum’s Enterprise Data Cloud vision

When put that starkly, it’s overstated, not least because

Specialized Analytic DBMS != Data Warehouse Appliance

But basically it makes sense, for two main reasons:

Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.

Of course, the EDC vision isn’t quite as new or differentiated as Greenplum ideally would wish one to believe.

To a first approximation, EDC sounds a lot like what eBay has already built on Teradata equipment.
Greenplum’s EDC vision also sounds a lot like what Stuart Frost was talking about at DATAllegro, what Dell was planning to build on DATAllegro equipment, and what Stuart continues to talk about now that he’s been acquired into Microsoft.
Something like EDC can also be presumed to be implicit in the strategies of the other one-size-fits-all vendors — i.e., Oracle and IBM.
Greenplum has only implemented a little more of the EDC vision so far than have other firms, unless you give it credit for being cheap/fast/MPP/running on commodity hardware, but deny that credit to Teradata (specialized hardware, and not cheap in its most popular configurations), Oracle (ditto for Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not released yet).
Specifically: In Greenplum Release 3.3, which is being announced today, Greenplum is introducing the (enhanced?) ability for data marts to be spun out as a background operation, while the database otherwise remains functional. As of 3.3, spinning out a data mart is a command-line operation. But in Release 3.4, Greenplum plans to offer a web-based interface for same, at which point the “self-service data mart creation” discussion will become operative. Otherwise, EDC is a roadmap/vision/statement-of-direction much more than it is a fully-baked technical project.

One particular source of potential confusion is Greenplum’s emphasis on the buzzphrase self-service (data mart). This seems to be a conflation of two related concepts:

End users should be able to create new data marts themselves. Strictly speaking, I view this ability as useless at most enterprises, and important at very few, because of logistical issues. (Who gives the permissions? Who decides which hardware is used?) That said, useless “end user” tools often wind up being important productivity aids for IT professionals, and this kind of “self-service” would surely be another example. Edit: Hmm. Doug Henschen inspired me to think that over again, and I’m beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value.
End users should be able to manage data marts themselves, once created. That’s a great idea, full of agility and don’t-make-IT-a-roadblock goodness. Data miners and similar analytic professionals commonly have the technical ability to manage a simple database, and should be allowed to do so if it’s ensured that they don’t break anything for anybody else.

One thing that’s needed for this technology to come to full fruition is sophisticated data movement and synchronization. Ideally, some tables in a data mart could be virtual — views against a central database. But others would be physically recopied from the center, with all the ETL/ELT/ETLT/replication issues that entails. Meanwhile, it’s not obvious that the ideal architecture is a simpleminded hub-spoke — perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g. by starting a table out as a view and changing over to making it a physical copy as usage profiles change.

Oliver Ratzesberger of eBay also argues that workload management — not a current Greenplum strength — can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements). More generally, if you’re setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There’s a big virtualization opportunity here — but virtualization requires a lot of system management infrastructure.

Related links

My recent post on reinventing business intelligence
Greenplum adviser Joe Hellerstein’s pitch for agile data warehousing
Charlie Bachman’s “private database” idea, which never went anywhere (pp. 138-139)
Greenplum’s EDC and Release 3.3 press releases

Categories: Analytic technologies, Data warehouse appliances, Data warehousing, DATAllegro, EAI, EII, ETL, ELT, ETLT, eBay, Greenplum, Microsoft and SQL*Server, Parallelization, Specific users, Teradata

Subscribe to our complete feed!

Comments

30 Responses to “The future of data marts”

Jerome Pineau on June 8th, 2009 10:51 am

So is the only difference bet. Vertica and GP in the cloud is that GP has a dedicated infrastructure while Vertica uses EC2?
Curt Monash on June 8th, 2009 11:24 am

Not exactly. Greenplum doesn’t offer cloud-based services of any kind at this time. It encourages its customers to build “private clouds”.

Your comment would be closer to accurate if you were contrasting Aster and Kognitio.
Jerome Pineau on June 8th, 2009 2:37 pm

Oh so hang on a second – their big announcement today was about on-premise “private GP clouds” and not some dedicated hosted service they provide??
Amr Awadallah on June 8th, 2009 3:56 pm

Very confused, isn’t that what Greenplum already offered?
Jerome Pineau on June 8th, 2009 4:04 pm

Well it looks like they’re basically saying look, go cobble something using metal/virtual/cloud together and we will fit on top of that. But it’s still your IT ops handling all the provisioning (for their private cloud). In essence what it seems like to me at this point is a set of best practices really relatively inline with their MAD EDW philosophy. Unless I’m missing something which is a distinct possibility – I mean geezus I initially assumed they were providing the cloud infrastructure 🙂
Ben Werther on June 8th, 2009 7:29 pm

I’d recommend a read of the Greenplum EDC whitepaper at http://www.greenplum.com/resources/complete-library/ (No registration required).

The EDC initiative is about 3 things:
– Platform technology that allow business analysts to self-serve provision warehouses/sandboxes via a web console and access/replicate data into their warehouse from anywhere in the EDC. (i.e. a ‘private cloud’ approach applied to scale-out data warehousing). This is not just about spinning up a database in virtual machines. We’re building a new layer of services that really allow business and IT to each focus on what they do best and reduce the areas of friction that exist today — e.g. self-serve cluster provisioning from server pools, local or geographically remote data replication, data lineage and cross-warehouse metadata, and more.
– A new data warehousing methodology that challenges the formal ‘everything in one database and one data model’ that has been prevalent over the past 25 years. This isn’t something that Greenplum has cooked up — it is simply a reflection of what our customers are putting into practice today.
– An ecosystem of customers and partners that believe in the vision and are working with us to shape and deliver on it.

Note that most enterprises that we work with aren’t looking to the public cloud for data warehousing – largely because the data is being generated in-house and they don’t want to push TBs over the Internet daily. But they do want to achieve many of the touted ‘cloud’ benefits in-house. i.e. They want to empower business analysts to serve themselves without lots of process or IT delays in the way. And they want IT to consolidate infrastructure, get their arms around data mart proliferation, and improve service levels but without some heavy-handed approach that requires unifying all the data models.
Sean Kain on June 8th, 2009 10:42 pm

Isn’t this another “buzzword bingo” name for something that is pretty common in mature DW environments. Pretty much anyone who has a data mining team use the DW must do this.

Curt has referred to it before at eBay with Oliver Ratzesberger’s Analytics as a Service blog at http://www.xlmpp.com/articles/16-articles/39-analytics-as-a-service. See also a presentation from Oliver on this topic at http://www.teradata.com/t/WorkArea/DownloadAsset.aspx?id=5761
Jerome Pineau on June 9th, 2009 1:01 am

So basically, what companies besides Aster, Kognitio and Vertica currently have production cloud implementations today?
Per-terabyte pricing | DBMS2 -- DataBase Management System Services on June 9th, 2009 4:31 am

[…] only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, […]
Greenplum spins ‘Enterprise Data Cloud’ vision | Cervaza.com BLOG News From The Net ! on June 9th, 2009 5:27 pm

[…] notion of self-service data marts has merit, but with certain caveats, Monash said in a blog posting Monday. “Suppose users could order up the data mart they want, perhaps test it at a very low […]
Steve Wooledge on June 10th, 2009 12:29 am

I found this post from Daniel Abadi as a pretty balanced assessment of this news:
http://dbmsmusings.blogspot.com/2009/06/quick-thoughts-on-greenplum-edc.html

For example:
“7. It appears that the only part of the EDC initiative that Greenplum’s new version (3.3) has implemented is online data warehouse expansion (you can add a new node and the data warehouse/data mart can incorporate it into the parallel storage/processing without having to go down). All this means is that Greenplum has finally caught up to Aster Data along this dimension. I’d argue that since Aster Data also has a public cloud version and has customers using it there, they’re actually farther along the EDC initiative than Greenplum is …”

Aster has multiple customer deployments on public clouds — both Amazon and AppNexus. ShareThis is the largest DW-in-a-real-cloud deployment at Amazon (currently at 10 TB) and will be discussing their deployment at TDWI in San Diego in August at the Executive Summit:
http://www.eiseverywhere.com/ehome/index.php?eventid=4983&tabid=929
Ben Werther on June 10th, 2009 8:16 pm

We’ve been running Greenplum internally on EC2 for almost 2 years now, and use both EC2 and internal VMware pools for a range of QA and scale testing work.

Making Greenplum run on EC2 is almost zero work — we just haven’t seen material demand from large enterprises wanting to put their production, mission critical data warehouses in the public cloud yet. There’s no doubt it’ll come over time, and we’re supportive of the direction, but it just isn’t here yet.

Matt Aslett from the the451 group wrote a nice analysis on this topic (unfortunately only available through paid subscription), where he reinforced this point:

“Enabling cloud-computing deployments is about more than simply offering a version of your product running on Amazon . . . Adoption of data warehousing on public clouds has so far been limited to proofs-of-concept evaluations and trials rather than production deployments, we believe, and Greenplum’s focus on datacenter platforms could serve it well as enterprises look to private cloud architecture as a method of improving datacenter efficiencies before identifying workloads that could be migrated to public clouds.”

We’re encouraged by folks like Aster, Vertica and others that find interest in public cloud offerings to serve the current market of Web 2.0 companies which is definitely a good use case. If anyone is seeing that large enterprises are ready today for meaningful adoption of public cloud services for data warehousing, we’re ready to serve 😉
Curt Monash on June 11th, 2009 2:23 am

Ben,

I didn’t think it was possible to stretch the definition of Web 2.0 to the breaking point, but you may have just accomplished it. 😉

Best,

CAM
The future of data marts by DBMS2 | Data mart, Business Intelligence, Data warehousing and Reporting on June 16th, 2009 3:49 pm

[…] Read more Author: admin Categories: Data Governance, Data Mart Examples, Data mart Tags: Add new tag Comments (0) Trackbacks (0) Leave a comment Trackback […]
Netezza on concurrency and workload management | DBMS2 -- DataBase Management System Services on July 18th, 2009 12:52 am

[…] meeting. But while I was there I asked where Netezza stood on concurrency, workload management, and rapid data mart spin-out. Netezza’s claims in those regards turned out to be surprisingly […]
Data marts in the world of text | Text Technologies on September 20th, 2009 5:09 am

[…] The future of data marts Categories: Enterprise search, Ontologies, Search engines, Specialized search, Structured search Subscribe to our complete feed! […]
BI-Quotient » Blog Archive » Data warehousing for free! Terabyte sized data warehouse and business intelligence without license costs on October 26th, 2009 12:38 pm

[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]
BI-Quotient » Blog Archive » Data warehousing for free! Terabyte sized data warehouse and business intelligence without license costs on October 26th, 2009 12:38 pm

[…] (1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. […]
ahmad on January 10th, 2010 5:58 pm

Dr:
hi, how are you please I’m student in university i want example for application data mart and explain this example.
Interesting trends in database and analytic technology | DBMS2 -- DataBase Management System Services on January 31st, 2010 10:12 pm

[…] area I flat-out forgot to mention is easy data mart spin-out. Categories: Analytic technologies, Business intelligence, Data models and architecture, Data […]
Three kinds of software innovation, and whether patents could possibly work for them | DBMS2 -- DataBase Management System Services on March 23rd, 2010 4:19 am

[…] or functionality not just for end users, but also for administrators. In particular, SaaS, cloud, private cloud and/or appliance benefits are commonly concentrated in this […]
Greenplum Chorus and Greenplum 4.0 | DBMS2 -- DataBase Management System Services on April 12th, 2010 7:54 am

[…] Greenplum is making two product announcements this morning. Greenplum 4.0 is a revision of the core Greenplum database technology. In addition, Greenplum is announcing Greenplum Chorus, which is the first product release instantiating last year’s EDC (Enterprise Data Cloud) vision statement and marketing campaign. […]
eBay followup — Greenplum out, Teradata > 10 petabytes, Hadoop has some value, and more | DBMS 2 : DataBase Management System Services on October 10th, 2010 11:55 am

[…] took the opportunity to ask what kinds of data marts (virtual or otherwise) were spun out in […]
Naym on February 28th, 2011 4:33 pm

Curt, the link to Scott Yara’s own words does not work
Curt Monash on February 28th, 2011 7:33 pm

Thanks, Naym. That’s totally my fault, and I don’t now know what I had in mind. I’ll just delete the reference.
Aden on July 4th, 2013 9:28 pm

Big Data Analytics is not only for retail Business Intelligence, even toughh that is where some of the greatest advancements are currently occurring. Big Data Analytics is also the future of Infrastructure Asset Management. Each industry has core infrastructure that must be Asset Managed over its life-cycle through predictive modeling. Big Data Analytics will evolve too rapidly (with increasing volume, variety, velocity and complexity of available classes of data) for any industry organization to standardize and maintain THE method of doing Infrastructure Asset Management through Big Data Analytics.Those wishing to take a leadership role in the Big Data Analytics required for successful Integrated Asset Management of infrastructure need to establish the standards for the backbone of Big Data in their industry sector that is industry associations should establish the standards for data governance, management, control, and compliance through a Central Data Warehouse. Then let consultants, utilities, software companies, and academics knock yourself out and do the analytics you want any way you want to do it and provide competitive differentiation to the businesses. If you want to work with our data scientists, great; if you have your own data scientists or a third-party that helps you, fabulous. But there is only one place that you come to get the data and that’s [the industry association’s Central Data Warehouse.] From the CDW, infrastructure asset design and performance can be independently Validated, Verified (iV&V), and benchmarked against peers (as is done in the software sector). As a civil engineer, I believe this is the future of the engineering standard of care in all sectors and will change engineering practice as we know it.
plr article on September 23rd, 2014 5:00 am

hello!,I like your writing very a lot! percentage we communicate more about your
article on AOL? I require an expert on this house to resolve my problem.
May be that is you! Having a look ahead to peer you.
Snowflake Computing | DBMS 2 : DataBase Management System Services on October 22nd, 2014 4:46 am

[…] has a physical copy of the data; i.e., this is not related to the Oliver Ratzesberger concept of a virtual data mart defined by workload […]
Dejan on March 12th, 2016 5:13 am

Great and informative article.
Data warehousing for free! Terabyte sized data warehouse and business intelligence without license costs - Sonra on February 20th, 2024 2:52 am

[…] for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts. (2) I can see another use case for departmental solutions. You could set up your first couple of […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The future of data marts

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin