PAX Analytica? Row- and column-stores begin to come together
Column-store proponents are prone to argue, in effect, that the only reason to implement an analytic DBMS with row-based storage is laziness. Their case generally runs along the lines:
- Analytic queries commonly return only a fraction of all possible columns.
- Only returning the columns needed
- Saves I/O
- Saves cache space
- Reduces processing
- Facilitates compression
- Presumably all those row-based MPP vendors just went row-based because they had a fine row-based DBMS (usually but not always PostgreSQL) to build on.
Pushbacks to this argument from row-based vendors include:
- Yes, but it’s harder to update a column store
- Yes, but there are more steps to retrieving a bunch of columns than there are to retrieving the same information from row stores
plus generous dollops of:
- We’re doing just fine, thank you
- We’re not seeing column stores much in the marketplace
- Don’t believe all that academic hype
- Column stores reek of elderberries, and are powered by hamster wheels
(OK, I made that last one up, but I do hear the other claims frequently.)
However, there are at least two ways in which row- and column-stores are beginning to come together. First, there are lots of rumors about row-store vendors bringing out column-store options, even beyond the recent Ingres/VectorWise announcement. (But anything I may know about same beyond noticing the rumors fly by is surely under NDA.) Second, column-store vendors Vertica and VectorWise are bringing out a kind of row/column hybrid storage option.
Vertica 3.5 introduces what Vertica calls “FlexStore.” A key part of FlexStore is the ability to store data not just in pure columnar format, but also to group columns together in what amounts to sub-rows. This is advantageous when data is retrieved together and, I presume, when it is updated. There’s a tradeoff in giving up column stores’ compression advantages, however, and use of this feature is not recommended for columns that are frequently retrieved independently. Vertica also notes that since it typically uses 1 megabyte block sizes, any table smaller than that shouldn’t be broken into columns at all.
VectorWise, of course, doesn’t have a product right now, but has gotten a bunch of recent publicity around the column-store product it plans to ship via its partner Ingres in 2010. When I asked Peter Boncz about row/column hybridization inside VectorWise (not federating between Ingres and VectorWise, but rather truly within VectorWise), he said one of the storage options was PAX, and pointed me at a 2001 paper by a group of academics that includes the ubiquitous Dave DeWitt. PAX turns out to stand, in creative spelling, for Partition Attributes Across.
The PAX idea is to store as many rows of data as can fit into a block, but within the block store them in columns. This preserves some of the compression and cache-efficiency benefits of column stores, while also bringing back whole rows in a single step. (I think Vertica’s FlexStore does something similar to this, but I’m not sure.)
Further confusing things, Peter Boncz of VectorWise told me VectorWise can support “any hybrid” of columnar storage and PAX.
Bottom line: The distinction between row- and column-stores isn’t going to go away any time soon, but it is at least beginning to blur a bit.
Comments
11 Responses to “PAX Analytica? Row- and column-stores begin to come together”
Leave a Reply
[…] VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.) […]
[…] Curt Monash recently noted there are a couple of approaches emerging to hybrid row/column […]
[…] sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” […]
[…] Vertica offers a post on its 3.5 release, with a riff on the popular theme “We’ve fixed some weaknesses in our prior versions that we didn’t previously say we had.” More important, Vertica is pretty clear on the virtues of its hybrid columnar architecture. […]
[…] VectorWise 1.0 is pretty purely columnar. There’s a bit of PAX, but it’s mainly automagic/under the covers. The one user-controlled exception I understood […]
[…] main thing in Aster Data nCluster Version 4.6 is Aster’s version of hybrid row-column store technology. Technical highlights, if I’m getting it right, […]
[…] seems not to be met by any of the vendors cited — including Vertica, which introduced Vertica FlexStore in mid-2009. And while I’m at it — Aster Data nCluster definitely meets criterion […]
[…] to praise Greenplum for true hybrid row/columnar data management, a feature shared by Teradata and Vertica, among others, but not by Oracle, DB2, or […]
We had some talking heads pitching Exadata to us yesterday. I say talking heads because they demonstrated a definite lack of knowledge of certain of their subjects, such as RAC. Anyway, they described the HCC method as storing ‘like’ columns together in a block. So within a bulk load (which, I think, gives us an idea of just what the scope of a Compression Unit is…) they can persist columns that are integers into a block together, columns that are strings into another block together, etc. I’m guessing that they determine the optimal compression algorithm for each block (or set of blocks that hold the same datatype) based on the datatypes in those blocks.
[…] Specifically, if data in a relational table is grouped together according to what row it’s in, then the database manager is called “row-based” or a “row store.” If it’s grouped together according to what column it’s in, then the database management system is called “columnar” or a “column store.” Increasingly, row-based and columnar storage are being hybridized. […]
[…] execution engine such as Impala — can refer to. Within these big blocks, Parquet is PAX-like; i.e., it stores entire rows in the same big block, but does so a column at a time. However, […]