Relational data warehouse Expansion (or Explosion) Ratios
One of the least understood aspects of data warehouse technology is what may be called the
Expansion Ratio = (Total disk space used, except for mirroring) / (Size of the base database).
This is similar to the explosion ratio discussed in the OLAP Report’s justly famous discussion of database explosion, but I’m going with my own terminology because I don’t want to be tied to their precise terminology, nor to their technical focus. Expansion Ratios are hotly debated, with some figures being:
- Teradata claims an Expansion Ratio of 8-9X for Oracle, 6X for DB2 (open system version), and 2.5X for Teradata. The underlying source is data warehouses they’ve replaced, so there may be a bias toward out-of-control warehouses on the part of their competitors.
- An anonymous appliance vendor exec said to me off the top of his head that Oracle has 6-8X Expansion Ratios.
- Oracle’s TPC-H submissions in the largest size range (10 terabytes) have 9.7-10.5X Expansion Ratios, if I’m reading the TPCs correctly.
- Oracle cites a survey of 8 customers with 10-60 Tb database size in which the Expansion Ratio works out to 1.6X. (More on this anomalous result below.)
I don’t have actual figures from Netezza and DATallegro, but I imagine they’d come out lower than 2X, possibly well below.
Expansion Ratios are a huge factor in data warehouse cost. It’s not just the disks and their controllers; it’s also the processing done of those extra indices or aggregates or whatever, and the human cost of administering them. (E.g., Teradata claims that when moving from Oracle to Teradata, there’s commonly a full order-of-magnitude reduction in both index number and aggregate index size; clearly, the DBA cost in such a scenario would be drastically reduced as well.)
The question naturally arises: What is all this space being used for? Well, some is necessary as a pool for temporary/intermediate tables that will arise in query processing. Some arises if you reserve fixed amounts of space for something before filling them. Some just happens because disks aren’t 100% utilized. But all that shouldn’t be enough to get you up even to 1.5X. All the rest has to come from indices, aggregations, and their equivalents.
Single-column indices, I would think, are far from a complete explanation. Yes, if you indexed every column, you’d clearly have a greater than 1X increment to the Expansion Ratio, certainly before compression and probably after as well. But the increment wouldn’t be much over 1X either. So the big explosion would almost have to be in the multi-column stuff – prejoins, aggregated rollups (with are inherently indexed by multiple dimensions at once), and so on.
But by that logic, if you don’t have a major star/snowflake/multidimensional ROLAP flavor to your database, and you also don’t need any terribly explosive prejoins, you’re not going to have a large Expansion Ratio. Thus, Oracle’s claim to have very low Expansion Ratios in some large warehouses is completely credible. And so I’ll end by quoting Oracle’s claim, verbatim, by permission, from an email I received:
An analysis of eight Oracle data warehouses ranging from 10 TB to 60TB in size revealed that on average 63% of the database was occupied by data within tables, 18% by indexes, and 19% by unused space and temporary space. For the four customers who indicated which tables were summaries (either as materialized views of custom-built summaries), none of these customers dedicated more than 15% of the database to summary data.
Comments
9 Responses to “Relational data warehouse Expansion (or Explosion) Ratios”
Leave a Reply
Curt, I believe that end users of data warehouse technologies have good control over the expansion, or uptake, ratio. It really depends on the level of performance that is needed and the amount of the data that they actually access in online queries. Some sites index everything and others only do it when absolutely necessary. Same goes for creating summary tables and cubes. TPC numbers represent the high end of anything that would be implemented at a real site because the vendors use TPC to brag most about the performance figures, not the price/performance. The Teradata numbers seem about right in practice.
By the way, Teradata has an interesting new feature called “multi temperature” which allows sites to pay less for the data they don’t access as much. Many have been implementing concepts like this for some time to find a good price/performance balance.
[…] When I posted a few days ago about Expansion Ratios, I passed along Teradata’s estimate that there was a significant gap between Teradata and IBM. IBM’s own estimate, however, is that a typical Expansion Ratio for DB2 is a Teradata-like 2-3X in DB2 Version 8, and that it’s falling further in Version 9 due to compression. (DB2’s compression sounds like it’s the most aggressive in the business, specialty columnar or MOLAP products perhaps excepted.) Incidentally, this suggests that the indexing features DB2 has that Teradata doesn’t – e.g., alternate datatypes like geospatial – aren’t heavily used by a large fraction of the customer base. […]
William,
Yes, but the question is how much you need in the way of indexing and summaries to get the same performance from different systems.
Also, the grand hash partition that starts off a typical MPP installation saves some index space. And if table scans are fast enough a lot of the time, you need fewer indicies than if they almost never are.
Anyhow, if you look up to my IBM/Teradata post, IBM claims that indeed their installations wind up very comparable to Teradata’s. But then, IBM and Teradata also have pretty similar overall technical approaches now.
[…] size vs. Oracle for the same data, due to, for example, variable record lengths. Such factors are not unheard of in data warehousing, of course; but for OLTP his comment was quite the jaw-dropper. On the other hand, OLTP databases […]
[…] weak recently, and deservedly so. MOLAP’s problem with database explosion produces ridiculous expansion ratios, recent work to lessen the problem notwithstanding. Nonetheless, the Essbase installed base is […]
[…] Индексы и временное рабочее пространство. Это то, о чем я уже писал пару лет назад в моем посте о коэффициентах расширения. […]
[…] The company says that experience shows >10:1 compression of user data is realistic – i.e., an expansion ratio that’s fractional, and indeed better than 1/10:1. Accordingly, despite the lack of […]
[…] also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for […]
[…] also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for […]