Are column stores really better at compression?
A consensus has evolved that:
- Columnar compression (i.e., value-based compression) compresses better than block-level compression (i.e., compression of bit strings).
- Columnar compression can be done pretty well in row stores.
Still somewhat controversial is the claim that:
- Columnar compression can be done even better in column stores than in row-based systems.
A strong plausibility argument for the latter point is that new in-memory analytic data stores tend to be columnar — think HANA or Platfora; compression is commonly cited as a big reason for the choice. (Another reason is that I/O bandwidth matters even when the I/O is from RAM, and there are further reasons yet.)
One group that made the in-memory columnar choice is the Spark/Shark guys at UC Berkeley’s AMP Lab. So when I talked with them Thursday (more on that another time, but it sounds like cool stuff), I took some time to ask why columnar stores are better at compression. In essence, they gave two reasons — simplicity, and speed of decompression.
In each case, the main supporting argument seemed to be that finding the values in a column is easier when they’re all together in a column store. That makes sense. I imagine the difference would be smallest if the row store had strictly fixed-length fields, with anything like a VARCHAR being tokenized down to a known length. In that case the database could be treated as an array — but you’d also have to wonder whether any significant row-based benefits still remained.
Let’s draw a further distinction between:
- Columnar compression techniques that depend primarily upon individual values — tokenization, prefix compression, etc.
- Columnar compression techniques that depend upon sequences of values, such as run-length encoding (RLE) or delta compression. (Note: Some people think this is the only kind that should be called “columnar”, but I disagree.)
Row stores seem to do OK with individual-value compression, especially tokenization, albeit with some awkwardness. (For example, if cardinality forces you to change token length, that’s a bigger deal in a row store than a columnar system — which may be why DB2 caps the number of tokens at 4096.) But row stores rarely tackle the sequence-of-values stuff; the only counterexample I can think of is Netezza’s delta compression, and Netezza has FPGAs (Field-Programmable Gate Arrays) to speed the whole thing up.
Summing up, I think it’s fair to say:
- Column stores will almost always have a compression and/or compression performance advantage over row stores.
- Sometimes that advantage will be small.
- Sometimes it won’t be.
Related links
- To counteract some marketing confusion, I explained the difference between columnar compression and columnar storage. (February, 2011)
- I outlined the Netezza and DB2 approaches to row-store columnar compression. (June, 2010 — before IBM bought Netezza)
- I commented on Mike Stonebraker’s thoughts regarding column stores and compression. (March, 2007 — when Vertica was just a pup)
- Oracle’s Hybrid Columnar Compression (September, 2009) is indeed a columnar architecture from the standpoint of compression, even though it’s very row-based from the standpoint of I/O.
Comments
10 Responses to “Are column stores really better at compression?”
Leave a Reply
Excellent post, Curt. Columnar storage definitely makes compression easier and faster, for both on-disk data and in-memory data representations.
In our project (Shark/Spark), another important reason we chose to go columnar is to avoid the in-memory object overhead in the Java virtual machine (JVM). This overhead takes two forms when very large datasets are stored in memory: object storage overhead and excessive garbage collection times,
The JVM’s memory model dictates that every object has 12 – 16 bytes of memory overhead. In row-oriented format, each row is represented as one or more Java objects. When the rows themselves are small, this per-object overhead is substantial: using a row-based approach, 200 MB of uncompressed data can occupy up to 1 GB of space in-memory, a 5X overhead!
In addition to storage, there are processing implications due to garbage collection (GC), which increases linearly with respect to the number of objects in the heap. When many rows are being processed in a single heap, it can result in millions or even billions of Java objects. For a large heap (e.g. 32+GB), GC can take several minutes.
In our column-oriented format, the space footprint is 1/3 to 1/5 of row-oriented counterpart even without applying compression, and a full GC takes less than a second. Thus, a columnar approach lets us retain the benefits of the JVM, while achieving compact and fast in-memory storage.
On the ‘speed of decompression’ point:
Column store gives better L1/L2 CPU cache
utilization compared to row store when
decompressing column data. CPU data
pre-fetches work better with column store.
I used to work for Sybase a long time ago. One of the IQ engineers explained to me that the compression was a side-effect of the aim for better performance. It was an added bonus rather than the goal. As far as I know Sybase IQ was the first to market with this technology and when I last used it a few years ago, no technology under $1 million came close.. I assume that’s changed by now – all vendors seem to have adopted a similar approach.
>[HCC] is indeed a columnar architecture from the standpoint of compression, even though it’s very row-based from the standpoint of I/O.
…I’m not sure I understand this point. Both column and row oriented data ends up stuffed into blocks. Scanning pure columnar block data and hybrid columnar block data is in my experience the same physical I/O profile. Albeit the pure columnar suffers no wasted payload (unneeded columns). But, like I said, I don’t think I understand the (perhaps subtle) point you’re making, Curt. Can you clarify?
…Also, the way I like to help people understand Oracle’s Hybrid Columnar Compression is to point out the fact that Oracle’s engine is a row engine. Oracle groups adjacent data blocks into a logical units called a Compression Unit. During bulk loading, Oracle more or less treats columns as “single-column rows” which get stuffed into Compression Units. It’s a simple way to grasp the concept.
Kevin,
It is my understanding that, in Oracle HCC, if you retrieve the columns you want, you also have to retrieve the columns you don’t want. Is that accurate?
If so, that’s more I/O than only retrieving the columns you want — potentially 10-100X more. And it could be similarly more use of cache, memory bandwidth, and so on.
Hi Curt,
Yes.
A read of a Compression Unit (generally 32KB) will have a number of rows and all of the columns of those rows–albeit re-oriented in the single-column-row manner I described. So, yes, every physical I/O includes unneeded data–unless the SQL statement is “select * from tab” or the users’ table has only a single column.
To make it easier to picture, consider a row of 32KB with with one of the columns being of type char(1). A select length(c) from the table will suffer just shy of 32KB of wasted physical I/O and memory buffering overhead. If that same compression unit were used in Oracle’s In-Memory Parallel Query it would have to flow over the Exadata iDB interconnect (about 32KB wasted) and installed into the SGA (about 32KB wasted).
So with that I drew a worst-case scenario to help folks understand the functionality.
The moral of the story (as you know) is reducing effort is not the same as eliminating effort.
[…] finally, Shark gives a columnar storage format to its RDDs, which has it’s already been discussed on this blog. Categories: BDAS, Spark, and Shark, Data models and architecture, Hadoop, MapReduce, […]
[…] Are column stores really better at compression? […]
[…] SOURCE: http://www.dbms2.com/2012/12/02/are-column-stores-really-better-at-compression/ This entry was posted in Uncategorized on January 29, 2013 by […]
[…] [3]http://www.dbms2.com/2012/12/02/are-column-stores-really-better-at-compression/ […]