Column stores vs. vertically-partitioned row stores
Daniel Abadi and Sam Madden followed up their post on column stores vs. fully-indexed row stores with one about column stores vs. vertically-partitioned row stores. Once again, the apparently reasonable way to set up the row-store database backfired badly.*
*Presumably, the details of the test were similar to those Abadi laid out in a long comment to the first post linked above.
However, while this experiment highlights the difficulty of using general-purpose/OLTP DBMS for efficient data warehousing, it isn’t necessarily dispositive for a comparison of columnar systems to data-warehouse-specialist row-based systems. The three reasons suggested for the poor performance of vertically-partitioned row stores were:
- Lots of extra data being dragged around in the skinny rows.
- Lack of ability to use vertical and horizontal partitioning together.
- Optimizers that can’t handle the resulting joins.
Well, row-based systems like Netezza and DATAllegro have forms of range partitioning. (Surely Teradata does too, but I don’t know their product as well as the others.) Netezza in particular has zone maps, while DATAllegro largely chose Ingres over the more obvious PostgreSQL as its underpinning largely for its range partitioning capability. And surely their optimizers are going to do decent jobs on basic data warehouse joins.
Comments
10 Responses to “Column stores vs. vertically-partitioned row stores”
Leave a Reply
Range Partition has been their as long as database have been. DatAllegro or Netezza cannot even compare to a Column store when it comes to scan performance or joining large tables. Range partitions are not valid under adhoc and adhoc is what true analytics is all about.
Netezza might have a chance in that arena but Column stores is the way to go for true adhoc.
This strikes me as more theory based discussion that is completely divorced from the realities of corporate computing. Fortune 100 companies that rely on large scale data analytics would be better served by a discussion of real world, verifiable examples of production implementations of the columnar approach that focus on; concurrency in mixed workload situations, the ability to provide real time integration to ETL applications that reside on disparate hardware/operating system platforms, 92 ANSI SQL and 99 Analytical Windows compliance, the ability to load data from transactional systems, and fail over and recovery characteristics. Until the columnar vendors can provide examples of production applications and describe their approach in these areas the entire discussion is entirely intellectual and frankly meaningless.
DW Consultant,
I didn’t bring up the subject of range partitioning. Abadi and Madden — the extremely smart column store proponents — did.
CAM
Bill,
You’re being overharsh. Even very large companies often have huge data marts with fairly low-concurrency analytic workloads.
CAM
On Paper (and following the web blogs) Column store is neat and clean compared to row stores like Oracle or Teradata ( that i have worked extensively) that requires manual tunning w.r.t index (bit map as in the case of Oracle and careful selection of primary index in the case of Teradata (Primary Key and Primary index in teradata are slightly different)) and preparation of Materialized views or Join index in case or teradata. Although similar to materialized view column stores do have “Projections”.. the implementation of Projections on an MPP environment again on Paper seems neat !. What confuses me is the tuple moving insert concept that i am not used to when compared with regular batch movement of data (ETL jobs that start at 2 am and finish by 5 am daily). Otherwise i love column stores and combined with mondrian presentation of BI info is a potent combination. My 2cents ! 😉
Balaji,
Well, all systems are neater on paper than in real life. 🙂
As for your ETL question:
Everybody has a true batch ETL option. But they also need to deal with cases where customers require sub-day latency. As a practical matter, 5-10 minute latency is a common design target. If a user truly needs latency much lower than that, they may have to pay up for a solution with the strengths — and weaknesses! — of an OLTP relational DBMS.
So how do you add records every few minutes to a column store? Adding them one record at a time is NOT a good idea, since each time you update anything you have to touch every column. Hence, the alternative is some kind of “microbatch” — gather all the records that come in in RAM, and write them to disk once every few minutes. The column store has to know to, on any query, look for data in two places — RAM and disk.
Vertica and ParAccel differ in how they do this. ParAccel says “Hmm. That sounds a lot like a DBMS’ ordinary RAM cache. So we’ll accept records straight into cache, and make sure our cache/disk synchronization is robust enough.”
Vertica, by way of contrast, has a separate in-memory data store, and indeed keeps researching whether the in-memory part should be a row or column store. Not coincidental to Vertica’s decision to have a separate store for ingestion is the fact that the ParAccel solution wouldn’t work for them anyway. Vertica takes compressed data from disk and does query operations on it in its compressed form. Hence, you can’t just bang an uncompressed new record into the RAM part of the system and expect it to fit well.
Best,
CAM
Bill,
It used to be that you trolled regarding some patent lawsuit among columnar DBMS vendors, which as I pointed out (after deleting the other 11 or so copies of your spam) is quite irrelevant to software purchase decisions.
Now you seem to be trolling about columnar DBMS’ lack of suitability for uses that they’re neither being sold nor bought for. Not very interesting either.
CAM
Yet Curt you refuse to directly address my questions instead you attempt to dismiss them and then when that doesn’t work you impugn my intentions, hardly the behavior expected from a blog of open ideas. So are you saying that the columnar approach is not intended to support mixed workloads and high levels of concurrency? It’s just a question.
I wouldn’t say that at all about columnar databases.
What I would say is that, Sybase IQ excepted, I can’t think of a columnar product mature enough to do a great job on concurrency or mixed workloads YET.
As for impugning your intentions — I think that’s entirely appropriate, given that you’re being a thorough-going jerk.
CAM
[…] Abadi and Sam Madden are at it again, following up on their blog posts of six months arguing for the general superiority of column stores over row stores (for analytic query processing). The gist is to recite a number of bases for superiority, beyond […]