August 5, 2008
Daniel Abadi and Sam Madden on column stores vs. indexed row stores
Daniel Abadi and Sam Madden — for whom I have the highest regard after our discussions regarding H-Store — wrote a blog post on Vertica’s behalf, arguing that column stores are far superior to fully-indexed row stores for not-very-selective queries. They link to a SIGMOD paper backing their argument up, provide some diagrams, and generally make a detailed case. As best I understand, here are some highlights:
- They tested some queries (a benchmark of their own team’s devising, I think) on a column store, a row store normally configured, and a row store with every column indexed and the DBMS forced to use all the indexes. The third option performed TERRIBLY.
- The big reason the third option performed terribly is that the DBMS was forced to do huge amounts of work reconstituting each row — much more than it would have to do in a column store, let alone in ordinary row-store operation.
- They provide an IOU at the end for a follow-on post with a less self-defeating design on the row store. However, it’s not clear whether they plan to consider the case of a row-store with a couple of indexes on each column (one in record order, one in sort order), which would be the most obvious way to simulate the advantages of a columnar system in a conventional row store.
- The benchmark was presumably against Oracle or some such DBMS, rather than a DW-optimized row store such as DATAllegro or Kognitio WX2 (to name two that run on similar hardware to Vertica). It showed a little less than 6X performance advantage for the column store (presumably Vertica or its research predecessor C-Store). It’s not obvious to me that DW-optimized row stores wouldn’t do as well or better.
Comments
8 Responses to “Daniel Abadi and Sam Madden on column stores vs. indexed row stores”
Leave a Reply
[…] Original Patent Prospector […]
[…] Original simplerich […]
Hi Curt,
Thanks for the kind words. Just wanted to add a few clarifications:
1) The column-store we used was C-Store. The reason why we chose to use C-Store over Vertica was that the paper required some additions and subtractions to the column-store code for some experiments not discussed in our blog post (we added a new join technique and made some other changes to the query executer), and it is a lot easier to add and subtract code to a research prototype than to a production ready DBMS (besides, I’m a whole lot more familiar with the C-Store codeline, having written much of it for my PhD thesis).
Unfortunately, C-Store is a pretty slow piece of code. It only runs in single-threaded mode, and was completely unable to take advantage of the extra core on the benchmarking machine. It does not implement horizontal partitioning, which for this benchmark, would have improved performance by at least a factor of two. And the last time anyone attempted to do any serious performance optimization of the code was three years ago (seriously). A whole lot of new code has been written since then.
Hence, C-Store is not representative of commercial column-stores (like Vertica) which do not have these shortcomings. Consequently, that 6X performance difference is much lower than what is typically observed for commercial column-stores.
2) Keep in mind that we benchmarked single-node performance of the database code only. The DW-optimized row-stores such as DATAllegro or Greenplum tend to improve performance by adding parallelism (i.e., MPP/shared-nothing architecture). Their single-node DBMS is typically derived from an open-source DBMS (like Ingres or Postgres) and I would be surprised if they would be significantly faster running on a single machine than the commercial row-store that we used (I would actually guess they would be slower — our DBA used just about every trick in the book, including partitioning, indexing, and materialized views to get the commercial row-store to go as fast as possible).
3) We have come through on our IOU. See: http://www.databasecolumn.com/2008/07/debunking-another-myth-columns.html, though I don’t think our experiments are exactly what you had in mind.
Anyway, I think the bottom line message to take away from our blog post not the comparison of the row-store vs. the column-store; rather it is that an index is a very different data-structure than a column.
[…] Abadi and Sam Madden followed up their post on column stores vs. fully-indexed row stores with one about column stores vs. vertically-partitioned row stores. Once again, the apparently […]
Can you give examples of production implementations of the columnar approach at a Fortune 100 company? If so how does the columnar approach handle mixed work loads? If we have multiple user communities using the same data submitting queries of varying complexity from simple reports to complex analytics, how would this be set up in the columnar approach? Would we need a separate column for each type of query and user? Do you have any real world examples of concurrency tests in mixed work load environments? How would the columnar approach be organized to handle customer level basket analysis with multiple count-distinct aspects in the same query. Can you discuss how the columnar approach would handle data feeds from transactional systems?
Bill,
I presume you think those questions are all rhetorical, but I have in fact discussed how they handle data feeds elsewhere in this blog.
And I addressed the concurrency point on another post where you repeated it.
CAM
I had the great pleasure of reading the paper and attending Daniel’s presentation at SIGMOD. It’s worth mentioning that – besides the main topic discussed here – the paper describes several really cool optimizations helping column stores to achieve better performance, like “invisible join” or “late materialization” which were also considered in some earlier papers.
Going back to “column stores versus row stores”, indeed, the key point seems to compare columns with indices as different types of structures designed to assist in different types of queries. So the question should be whether there is a way to combine advantages of those types of structures to support more mixed workloads including, for example, highly selective ad hoc queries in column stores? How about replacing conventional indices with something new, better adjusted to the columnar-like architectures?
my query is behave like row
and i want to display that as horizontal
e.g.my query is display name like
1. sauraj
2 rajiv
3.vikash
4.vicky
so on…………
but i want to display that like
1.sauraj 2.rajiv 3.vikash 4.vicky
5.vf 6.gsdgASd so on………….