Compression in columnar data stores
We have lively discussions going on columnar data stores vs. vertically partitioned row stores. Part is visible in the comment thread to a recent post. Other parts come in private comments from Stuart Frost of DATAllegro and Mike Stonebraker of Vertica et al.
To me, the most interesting part of what the Vertica guys are saying is twofold. One is that data compression just works better in column stores than row stores, perhaps by a factor of 3, because “the next thing in storage is the same data type, rather than a different one.” Frankly, although Mike has said this a couple of times, I haven’t understood yet why row stores can’t be smart enough to compress just as well. Yes, it’s a little harder than it would be in a columnar system; but I don’t see why the challenge would be insuperable.
The second part is even cooler, namely the claim that column stores allow the processors to operate directly on compressed data. But once again, I don’t see why row stores can’t do that too. For example, when you join via bitmapped indices, exactly what you’re doing is operating on highly-compressed data.
Comments
2 Responses to “Compression in columnar data stores”
Leave a Reply
One of the often overlooked reasons that Vertica compresses so well is that we don’t do updates in place. We can squeeze the data down to its entropy without worrying about what happens if an updated value will take more space, because the updated value gets written somewhere else.
The typical update-in-place row store could maybe compress a little better, but could never come close to our compression schemes. Because we compress sorted data by column, we can fit millions of values into a block sometimes. Since a row store needs block-level access this trick is impossible to repeat; the number of column values in the block is the same as the number of rows in the block.
This argument extends to processing as well. The row store is required to fetch the block and process the rows, an operation dominated by I/O time. Thus there isn’t anything to gain by operating on compressed data.
Note – I work for Vertica
[…] recently made a lot of posts about database compression. 3X or more compression is rapidly becoming standard; 5X+ is coming soon […]