Some issues in comparing analytic DBMS performance
The analytic DBMS/data warehouse appliance market is full of competitive performance claims. Sometimes, they’re completely fabricated, with no basis in fact whatsoever. But often performance-advantage claims are based on one or more head-to-head performance comparisons. That is, System A and System B are used to run the same set of queries, and some function is applied that takes the two sets of query running times as an input, and spits out a relative performance number as an output.
For example, Greg Rahn twittered to me that Oracle Exadata commonly outperforms existing Oracle installations by a factor of 50 or better, based on a “geometric mean”. What I presume he meant by that is:
- At any one user installation, a number of queries were compared on new system vs. old.
- In each case, the ratio between new and old running time was taken.
- The geometric mean of all those ratios was computed.
Math note: Reversing the order of the second and third steps doesn’t change the outcome at all. Either way, you wind up multiplying N things together, dividing by the product of another N things, and taking the Nth root of all that.
Looking just at the arithmetic, a straightforward geometric-mean approach is not a terrible methodology. Theoretically, I’d prefer to just add up the running times for the whole workload — using each system — and divide the two aggregates. But I’ve tested that change in a couple of cases, and it didn’t seem to make a big difference. In particular, the geometric is better than the arithmetic mean, which gives huge weight to the most extreme number(s) in the set. (SAP used to do just that in marketing BI Accelerator, getting huge results because one customer once got a better than 600X speedup on one particular query out of eight or so.)
That said, there are a number of factors that can render such comparisons seriously misleading. For starters, most of these comparisons omit to consider how often each query will be run. (One advantage to my preferred approach — add up total running time before doing any other arithmetic — is that you can easily weight queries by frequency.) Beyond that, especially when a new challenger system is compared to an old incumbent:
- Comparisons aren’t always run on identical or comparable hardware.
- One system might be running an obsolete release of the software.
- Queries may not be equally well written and tuned on both systems.*
- Both systems may not be equally well administered and/or configured.
In addition:
- A small set of queries may not be representative of a whole workload.
- Since complete workloads are rarely tested, concurrency and contention may not be properly modeled.
And last but not least:
- Marketing departments tend to take their best results and claim those are “typical.”
*I suspect that some of the most dramatic speed-ups we see are for queries that are just plain badly written. On the other hand — if you’ve been running your data warehouse software for a few years and still haven’t figured out how to write your queries for decent performance, maybe it’s somewhat too hard to use …
Related links
Comments
One Response to “Some issues in comparing analytic DBMS performance”
Leave a Reply
Curt,
Thanks very much for increasing the visibility of this highly debated topic. I’m particularly frustrated with all the claims companies make when they tout their performance numbers. Frankly I’ve never been particularly fond of industry benchmarks. But hey, that’s my personal opinion. Companies should also be interested in making sure customers are highly satisfied with the performance they are getting in their specific environments. Not just building for a benchmark.
In my years of experience, people don’t just buy on performance, they take into account many other variables as you well know. In the meantime, we will continue to see all kinds of marketing games and offers around the performance topic. Seems like it never ends.
Companies should always strive to improve performance in every release that they produce, however as we all know it is important to focus on listening to what customers and prospects want and frankly it is not always about better performance.