Columnar compression vs. column storage
I’m getting the increasing impression that certain industry observers, such as Gartner, are really confused about columnar technology. (I further suspect that certain vendors are encouraging this confusion, as vendors commonly do.) So here are some basic points.
A simple way to think about the difference between columnar storage and columnar (or any other kind of) compression is this:
- Columnar storage is a reference to how data is grouped together on disk (or in solid-state memory).
- (Columnar) compression is a reference to whether the actual data is on disk, or whether you save space by storing some smaller substitute for the actual data.
Specifically, if data in a relational table is grouped together according to what row it’s in, then the database manager is called “row-based” or a “row store.” If it’s grouped together according to what column it’s in, then the database management system is called “columnar” or a “column store.” Increasingly, row-based and columnar storage are being hybridized.
There are two main kinds of compression — compression of bit strings and more intelligent compression of actual data values. Compression of actual data values can reasonably be called “columnar,” in that different columns of data can be compressed in different ways, often depending only on the data in that column.*
*In a variant, a few columns can be compressed together, e.g. the various parts of an address. IBM DB2 and RainStor both do this.
The most important form of data value compression is commonly called tokenization, although the term dictionary compression is also used. Frequently, a column of data will have many repeated values — person names, city names, product IDs, product prices, etc. If so, these can be replaced by “tokens,” one per value, where the length of the token is just enough to account for all the unique values, and may be much shorter than the original data field. E.g., even if there are 1 million distinct values in a name field, tokenization could theoretically reduce 256 bytes down to 4.* A particular benefit of tokenization is that its compression can often be carried through into RAM, with joins being executed on tokenized columns.
*Admittedly, that extreme case assumes 256-byte character fields are stored fixed-length, which is ridiculous. There’s a reason for VARCHAR fields.
Where these ideas come together is that columnar compression usually works (even) better in column stores than in row stores. For example, if you’re using tokenization, and you just do your tokenization block-by-block, it will do more for you if there are more total values from the same column in each block — and that will happen when all the data in a single block comes from the same column.
Vendors that truly offer some form of hybrid row/column storage include Vertica, EMC/Greenplum, and Aster Data. Oracle Exadata, in my opinion, does not, but I can see why people might get confused and think otherwise. The tangled Oracle columnar story goes something like this:
- As per a white paper on Oracle Exadata’s “hybrid columnar compression,” Oracle Exadata clearly offers columnar compression.
- A diagram in that white paper shows data being partly grouped by column, in a PAX-like way. Oracle marketing sometimes tries to take “columnar” credit for that.
- That same white paper, as Daniel Abadi noted, contains the phrase “entire rows typically being retrieved with a single I/O.” That is NOT how column stores behave. A DBMS doesn’t need to have all the features advocated by Mike Stonebraker or Barry Zane to be a column store; but if it doesn’t have the fundamental feature of “brings back only the columns you want,” calling it “columnar” is pretty silly.
- Kevin Closson of Oracle graciously confessed, in a blog post and comment thread, that Oracle Exadata is not much of a column store.
- Oracle had a true columnar storage project going in 2009, although I don’t know its status. Also, Oracle evidently came pretty close to buying Vertica at one point.
Comments
21 Responses to “Columnar compression vs. column storage”
Leave a Reply
Curt-
Would you share your criteria on what constitutes a “true” hybrid row/column storage solution, in your opinion, and how the listed vendors technologies satisfy those criteria? It seems difficult to say something is or is not something w/o having a clear definition to measure against.
Greg,
As you surely know, I’m not a big fan of precise category criteria. But if you’re not doing column-at-a-time I/O, you’re not a column store.
Now, everybody has at least a kluge that lets you do something like that — materialized views if nothing else. But if, say, you’re schlepping along row IDs in every quasi-columnar I/O, that’s not much of a column store.
And just to be clear — strong predicate pushdown doesn’t make something a column store. It’s just another way to try to capture one of columnar storage’s signature benefits. Exadata partisans might say “Hey, we have columnar compression plus predicate pushdown — not to mention bitmapped indices — so that meets many of the same needs and specs as column stores.” They might even be right, in which case the fact that Exadata isn’t a column store wouldn’t be very important. But that’s not the point I was addressing this time around.
Now, when somebody like Greenplum or Aster puts out a true column store option and says “Well, we’ve checked off THAT box,” established column-store vendors are entitled to say “Bully for you — now just how mature is it?” But again, that’s not my topic this time around.
Curt-
Thanks. To my knowledge I don’t think anyone is passing a feature named Exadata Hybrid Columnar Compression as a pure column store which it most certainly is not, however, even Daniel Abadi classifies Exadata Hybrid Columnar Compression as a hybrid in his writeup which specifically mentions PAX, so I’m wondering why you do not.
Greg,
I’m not at liberty to go into full detail about everything I know about sales tactics, marketing tactics, and so on. So let’s just say:
1. I think the “true columnar” meme is getting increased attention.
2. Given that it is, I’d like people to at least make their categorizations CORRECTLY.
If it were up to me, there would be objective third-party benchmarks that showed performance subject to various kinds of sensitivity analysis — database size, concurrent usage, and all sorts of schema kinds. One of the top distinctions IMO would then be whether or not the schema and workload were inherently column-friendly. It may not be necessary for a row-based system to be fully performance-competitive with a column-based one in the columnar sweet spot, but at least it should offer a respectable alternative.
Absent such benchmarks, people use truly-columnar-I/O as a not entirely ridiculous indicator of fast query execution.
Curt-
It seems you missed my question in your response:
Daniel Abadi classifies Exadata Hybrid Columnar Compression as a hybrid in his writeup which specifically mentions PAX. Why is it that you do not also agree with his classification?
Greg,
Because I’m classifying for a different purpose than he is, making different ground rules appropriate.
To be even blunter — people are pretending that “hybrid row-column store” in the Oracle sense is the same thing as “hybrid row-column store” in the Aster and Greenplum sense, and people are getting misled as a result.
There’s a limit even to my bluntness, however, so I won’t spell out exactly who those “people” I’m referring to are.
Curt-
I’m really uninterested in the non-technical side of this dialog, but I am still very interested for you to share what your “ground rules” are in this evaluation. For example, does a RDBMS need to have both pure row and pure columnar storage? Does it need to have both row and column operators? What exactly does it need, in your opinion?
Greg,
Different people have somewhat different interests. I imagine this observation is no surprise to you. 😉
Curt-
Is that a refusal to share your evaluation criteria on which you have based your opinion? You have avoided responding to my direct request now in several comments. I’m ok with you stating you refuse to share your criteria, but then please do so.
Greg,
Based on the fact that you comment here, it is natural for me to think that you read this blog. If you do, it is natural for me to think that you know I don’t believe in precise definitions of product categories. Based on that, it is natural for me to think that you know there’s no answer to your question, and you’re just repeating it for the sake of being annoying.
What am I missing?
Curt-
There certainly is an answer to my question, it is not just for the sake of being annoying and my apologies for doing so. I’m looking for nothing more than some criteria which you used to come to your conclusion. Your conclusion is fine and all, and after all, it’s your opinion, but I think many people, especially me, find the details that were used to draw a conclusion more interesting than the conclusion itself. Also, sometimes people based conclusions on mis-information so including that makes for a more valuable discussion. It seems to me that open discussions are generally much more valuable than closed ones.
Greg,
All product category definitions, depending upon context, can become misleading in their application.
If we want to assess the progress of database technology, in practice and academia alike, it is perfectly reasonable to refer to a PAX-like system as a hybrid row-column store, something Daniel Abadi and I both have done.
But if we want to assess whether a product is likely to have columnar-like performance on column-friendly workloads, and if salespeople or marketers are implying that “hybrid” row-column systems have that benefit — then a stricter definition is called for.
So there’s no one single definition.
Yes I’m being vague. I gather that you would like me to be clearer. Would you and your Oracle colleagues also like me to be clearer about the second sentence of my blog post, in which I vaguely observed that vendors can deliberately encourage confusion? It’s something of a package deal.
I am surprised to see no mention of Sybase IQ a true columnar database. All storage and compression, and accesses are indeed at column level in this database.
In a true columnar database rows are shredded into columns and each column is stored separately. A row is formed by association of column values for a row using a rowid. This is done even for composite index. I.e. Index data is stored by itself just like a column.
Tokenization may be used in row-based as well as columnar databases. It may have nothing to do with database being columnar or not.
Anand,
Agreed both re Sybase IQ and tokenization.
SAND’s architecture is built on compression of bit strings. Our technique is to build and manipulate compressed and encoded bit strings (which we denote as BIT vectors), though the process to do this we don’t see as particularly hard, we believe doing so efficiently is a challenge and one we have overcome. How to leverage this technology, how often and in what way is a key part of compression. It is one thing to give someone the keys to a Ferrari, they need to know what to do once they get behind the wheel, how to use every last ounce of the power that is available, SAND does this automatically the database equivalence of intelligent predictive torque distribution. SAND’s architecture is designed using BIT vectors wherever possible. Used for much more than simple bitmap-indices.
Regarding Tokenization SAND employs database-wide dictionaries or Domains that function in assigning a unique token to every value in the database. This tokenization delivers the compression you refer to. SAP refers to this as dictionary compression when describing the architecture of HANA. It is rewarding to the team at SAND to see the rest of the database world has finally recognized what we knew and developed back in the days of the DeLorean. Your comments about tokenized operations, such as join processing, is spot-on. SAND performs database operations using tokenized operations wherever possible, not just for joins. Using them for all forms of predicate evaluation, semi- joins, group-by/order-by, text search, and in the Domain structure itself.
[…] arguably is the most mature of the modern column-store DBMS — i.e., the ones that don’t have their roots in bitmaps the way Sybase and SAND […]
[…] arguably is the most mature of the modern column-store DBMS — i.e., the ones that don’t have their roots in bitmaps the way Sybase and SAND […]
[…] row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or […]
[…] I/O bottlenecks via, for example, compression, columnar storage, and/or […]
[…] http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/ […]
[…] And yes — that means Oracle is now the only major relational DBMS vendor left without a true columnar story. […]