Greenplum
Analysis of data warehouse DBMS vendor Greenplum and its successor, EMC’s Data Computing division. Related subjects include:
- EMC, which bought Greenplum in 2010
- Data warehousing
- Data warehouse appliances
- PostgreSQL
Aster Data sticks by its SQL/MapReduce guns
Aster Data continues to think that MapReduce, integrated with SQL, is an important technology. For example:
- Aster announced today that it’s providing .NET support for SQL/MapReduce. Perhaps not coincidentally, Aster’s biggest customer is MySpace, which is apparently a big Microsoft shop. (And MySpace parent Fox Interactive Media is a SQL/MapReduce fan, albeit running on Greenplum.)
- Aster generally puts more emphasis on MapReduce than SQL/MapReduce rival Greenplum. That’s a non-trivial comparison, because Greenplum is making progress in SQL/MapReduce itself.
- When talking with Aster folks, I can’t get them to shut up hear a lot about SQL/MapReduce.
I was a big fan of SQL/MapReduce when it was first announced last August. Notwithstanding persuasive examples favoring pure DBMS or pure MapReduce over DBMS/MapReduce integration, I continue to think the SQL/MapReduce idea has great potential. But I do wish more successful production examples would become visible …
Categories: Analytic technologies, Aster Data, Data warehousing, Fox and MySpace, Greenplum, MapReduce, Parallelization | 4 Comments |
Per-terabyte pricing
Software-only DBMS vendors sometimes price per terabyte of user data. Vertica’s list price is $100K/TB. Greenplum’s list price is $70K/TB. In practice, both offer substantial discounts, especially at higher volumes. In both cases, this means raw data, uncompressed, without counting indexes or temp space.
Client experience teaches me that this definition is easy to forget, so let me reemphasize the key point:
Per-terabyte pricing is based on a calculated figure. Per-terabyte pricing is not based on the current disk space used by your database when managed by the DBMS you are replacing.
There’s at least one important difference in how Vertica and Greenplum calculate database size. No matter how many times you copy the data, Vertica only charges you for it once.* But if you spin out data marts and recopy data into it — as Greenplum rightly encourages you to do — Greenplum wants to be paid for each copy. Similarly, Vertica charges only for deployment, and not for test or development; I didn’t remember to ask what Greenplum’s policies are in those regards. (Edit: Greenplum says in a comment below that it doesn’t charge for test or development data either.)
*That policy is a great fit with Vertica’s performance recommendation that you should store columns in different sort orders, perhaps an average of two copies per column.
Categories: Columnar database management, Data warehousing, Greenplum, Pricing, Vertica Systems | 7 Comments |
Greenplum blogs about some customers
I’ve written some about Greenplum’s customers at eBay and Fox Interactive Media. But as I recently grumped, I’m not in the mood right now to write much about other Greenplum customers. Fortunately, Greenplum has filled the gap itself. Marketing chief Paul Salazar just blogged about a number of other big Greenplum customers. And last month Paul blogged in considerable detail about what he characterizes as an enterprise data warehouse (EDW) conversion — Oracle replacement — at a large pharmaceutical company.
Categories: Application areas, Data warehousing, Greenplum, Oracle | Leave a Comment |
The future of data marts
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept — mixing mine and Greenplum’s together — include:
- Data marts aren’t just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
- Thus, it would be really cool if business users could have their own analytic “sandboxes” — virtual or physical analytic databases that they can manipulate without breaking anything else.
- In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
- Whether or not you agree with that, it’s an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it’s an empirical fact that many business users have the clout to order up new data marts as well.
- Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching the story:
- Thesis: Enterprise Data Warehouses (EDWs)
- Antithesis: Data Warehouse Appliances
- Synthesis: Greenplum’s Enterprise Data Cloud vision
When put that starkly, it’s overstated, not least because
Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
- Analysis is performed on all sorts of novel data, from sources far beyond an enterprise’s core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
- On the other hand, consolidation can be a good idea even when systems don’t particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
More on Fox Interactive Media’s use of Greenplum
Greenplum’s most important reference is probably its energetic advocate Fox Interactive Media, even ahead of much larger user Greenplum user eBay, and notwithstanding Aster Data’s large presence in Fox subsidiary MySpace. I just ran across a “review” of Greenplum by FIM’s Brian Dolan, neatly summarizing his views about Greenplum’s strengths, weaknesses, and uses inside Fox. Highlights include: Read more
Categories: Data warehousing, Fox and MySpace, Greenplum, Web analytics | 2 Comments |
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more
Categories: Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Greenplum, MapReduce, Market share and customer counts, Parallelization, PostgreSQL, Pricing | 11 Comments |
Greenplum will be announcing some stuff
Greenplum is having a webinar Monday to announce “The Next Big Leap in Data Warehousing” (capitalization theirs). The idea they’ll be talking about is a genuinely good one. And off the top of my head I can only think of a few vendors who implemented it before Greenplum, and even fewer who emphasize it explicitly. So if you like webinars, you might want to listen in. I plan to blog about the general concept soon after the 12:01 am Monday embargo lifts. (Uh, guys, it is Monday rather than Tuesday, right?) Read more
Categories: Data warehousing, Greenplum, Specific users | 1 Comment |
eBay’s two enormous data warehouses
A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I’ve already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn’t like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I’m finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.
Metrics on eBay’s main Teradata data warehouse include:
- >2 petabytes of user data
- 10s of 1000s of users
- Millions of queries per day
- 72 nodes
- >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that’s a peak when the workload is scan-heavy
- 100s of production databases being fed in
Metrics on eBay’s Greenplum data warehouse (or, if you like, data mart) include:
- 6 1/2 petabytes of user data
- 17 trillion records
- 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
- 96 nodes
- 200 MB/node/sec of I/O (that’s the order of magnitude difference that triggered my post on disk drives)
- 4.5 petabytes of storage
- 70% compression
- A small number of concurrent users
Categories: Analytic technologies, Data warehouse appliances, Data warehousing, eBay, Greenplum, Petabyte-scale data management, Teradata, Web analytics | 48 Comments |
There always seems to be a fire drill around MapReduce news
Last August I flew out to see my new clients at Greenplum. They told me they planned to roll out MapReduce in a few weeks, and asked for my help in publicizing it. From their offices I went to dinner with non-clients Aster Data, who told me they’d gotten wind of a Greenplum MapReduce announcement and planned to come out ahead of it. A couple of hours later, Aster signed up as a client. In something of a pickle — but not one of my own making — I knocked heads, and persuaded both vendors to announce MapReduce at the same time, namely the following Monday. Lots of publicity ensued for both vendors, and everybody was reasonably satisfied. Read more
Categories: About this blog, Analytic technologies, Aster Data, Greenplum, MapReduce, Michael Stonebraker, Vertica Systems | 1 Comment |
Lots of analytic DBMS vendors are hiring
After writing about a Twitter jobs page, it occurred to me to check out whether analytic DBMS vendors are still hiring. Based on the Careers pages on their websites, I determined that Aster, Greenplum, Kickfire, and ParAccel all evidently are, in various mixes of (mainly) technical and field positions. At that point I got bored and stopped.
I didn’t choose those vendors entirely at random. If I had to name three vendors who are said to have had small layoffs at some point over the past few quarters, it would be ParAccel, Greenplum, and Kickfire. So if even they are hiring, the analytic DBMS sector is still pretty healthy … or at least thinks it is. 😉
Categories: Aster Data, Data warehousing, Greenplum, Kickfire, ParAccel | 5 Comments |