Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data
Data warehouse load speeds are a contentious issue. Vertica contrived a benchmark with a 5 1/2 terabyte/hour load rate. Oracle has gotten dinged for very low load speeds, which then are hotly debated. I was told recently of a Greenplum partner’s salesman steering a prospect who needed rapid load speeds away from Greenplum, which seemed odd to me.
Now Greenplum has come out swinging, claiming “consistent” load speeds of 4 terabytes/hour at its Fox Interactive Media account, and armed with a customer quote saying just that. Note however that load speeds tend to be proportional to the number of disks, and there are a LOT of disks at that installation.
One way to think about load speeds is — how long would it take to load the entire database? It seems as if the Fox database could be loaded, perhaps not in one week, but certainly in less than two. Flipping that around, the Fox site only has enough capacity to hold less than 2 weeks of detailed data. (This is not uncommon in network event kinds of databases.) And a corollary of that is — worldwide storage sales are still constrained by cost, not by absolute limits on the amounts of data enterprises would like to store.
Comments
3 Responses to “Greenplum claims very fast load speeds, and Fox still throws away most of its MySpace data”
Leave a Reply
I’m not sure whether to be flattered or insulted by the use of the term “contrived”. 😉 According to Meriam Webster’s online dictionary, contrived could be considered “artistic or ingeneous”, but I suspect most people associate a negative connotation with the term. Just to set the record straight, Vertica and its partner Syncsort followed a precedent set by Informatica and then subsequently repeated by Microsoft. They had each published data loading benchmarks using the TPC-H data. Since there is no industry standard data loading benchmark we simply followed the precedent, except we had our results audited and we published a disclosure report so the effort would be transparent. The details are available at http://www.vertica.com/etlworldrecord.
While we are on the subject of setting the record straight, readers might want to check out Eric Lai’s article in which Ben Werther of Greenplum explains that the 4TB per hour rate is not 4TB of continuous data loading. It’s 2TB in 30 minutes. http://www.networkworld.com/news/2009/031909-upstarts-speed-past-bi-vendors.html?page=2. Perhaps that rate could be sustained, perhaps not. That’s why full disclosure is important. Readers need to know what the data represents.
Dave
[…] Lai offers more facts, figures, explanation, and competitive insight than I did on Greenplum’s loading of the Fox/MySpace database, including that Greenplum is being loaded […]
Dave wrote:
“4TB per hour rate is not 4TB of continuous data loading. It’s 2TB in 30 minutes. ”
FWIW, I have blogged a bit about the topic of load “rates” versus load results:
http://kevinclosson.wordpress.com/2009/03/17/no-proof-means-all-spoof-exadata-lags-competitor-bulk-data-loading-capability-really/
The views expressed in this comment are my own and do not necessarily reflect the views of Oracle. The views and opinions expressed by others on this comment thread are theirs, not mine.