Mike Stonebraker on database compression — comments
In my opinion, the key part of Mike Stonebraker’s fascinating note on data compression was (emphasis mine):
The standard wisdom in most row stores is to use block compression. Hence, a storage block is compressed using a single technique (say Lempel-Ziv or dictionary). The technique chosen then compresses all the attributes in all the columns which occur on the block. In contrast, Vertica compresses a storage block that only contains one attribute. Hence, it can use a different compression scheme for each attribute. Obviously a compression scheme that is type-specific will beat an implementation that is “one size fits all”.
It is possible for a row store to use a type-specific compression scheme. However, if there are 50 attributes in a record, then it must remember the state for 50 type-specific implementations, and complexity increases significantly.
In addition, all row stores we are familiar with decompress each storage block on access, so that the query executor processes uncompressed tuples. In contrast, the Vertica executor processes compressed tuples. This results in better L2 cache locality, less main memory copying and generally much better performance.
Of course, any row store implementation can rewrite their executor to run on compressed data. However, this is a rewrite – and a lot of work.
That’s a pretty compelling argument. But in theory, I can think of a number of ways for a row store vendor to trump it, including:
- We agree, and we’ve built a whole set of specialized indices to have the same benefits.
- (Similarly) We agree, but fortunately we have the money and talent to pull off this very hard development task.
- That’s nice, but column stores have a natural disadvantage in updates, which matters a lot.
- Yes, but we have a lot less internode data movement than a column store does.
#1 and #2 surely are not true at this time, as Mike points out. #3 is an area of active debate, and should perhaps be evaluated on an application-by-application basis. #4 is just something I’m throwing out there, which might or might not prove to be valid. (The idea behind it, by the way, is that vertical partitioning comes at the partial expense of other kinds of partitioning, which can sometimes be used to help both sides of a join condition to be satisfied by the data on a particular node.)
But one thing seems sure – unless the row store vendors also come up with great compression stories, Vertica or some other columnar outfit will beat the pants off of them. Compression has arrived, big time.
Comments
8 Responses to “Mike Stonebraker on database compression — comments”
Leave a Reply
[…] The following is by Mike Stonebraker, CTO of Vertica Systems, copyright 2007, as part of our ongoing discussion of data compression. My comments are in a separate post. […]
[…] I’ve recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon as processor power increases; 10X or more is not unrealistic. True, this applies mainly to data warehouses, but that’s where the big database growth is happening. And new kinds of data — geospatial, telemetry, document, video, whatever — are highly compressible as well. […]
#4 isn’t likely. Internode data movement has to do with what data is on which node, not how the data is organized within the node.
Chuck,
So Vertica does horizontal partitioning (random or range) to the same extent row stores do?
CAM
Looks like Mike’s paid attention to what Sybase IQ (aka Expressway Technologies)
developed in the early 90’s. Just look at the TPC-H FDRs for IQ versus Oracle if
you want tangible proof on the financial impact of column-level compression, or
the lack thereof.
Peter Thawley
Yeah. It’s a pity that you guys never parallelized IQ properly. It might be a fearsome competitor if you had.
Do you have any thoughts on the implications for updates, loads, and while-compressed processing on the choices of bitmaps vs. other dictionary compression vs. delta compression vs. whatever?
Thanks,
CAM
[…] in March, I suggested that compression was a central and compelling aspect of Vertica’s story. Well, in their new blog, the Vertica guys now strongly reinforce that […]
This information is priceless. Where can I find out more?
Here is my web site – Concrete Block Foundation Repair Colorado Springs CO