Are row-oriented RDBMS obsolete?
If Mike Stonebraker is to be believed, the era of columnar data stores is upon us.
Whether or not you buy completely into Mike’s claims, there certainly are cool ideas in his latest columnar offering, from startup Vertica Systems. The Vertica corporate site offers little detail, but Mike tells me that the product’s architecture closely resembles that of C-Store, which is described in this November, 2005 paper.
The core ideas behind Vertica’s product are as follows.
- Data warehouse queries only need to retrieve the data in certain columns from disk. Therefore, storing data in columns reduces I/O.
- But a pure column store is hard to update in real-time, and data warehouses need real-time updates (both for “real-time” uses and just for error correction). Hence, there is a small (1 gigabyte or so) conventional row store to receive updates, the contents of which are periodically bulk-moved to the column store. It’s in main memory, and hence super-fast. (That’s not how the paper says C-Store was architected, but it seems to be one of the things that got changed for the commercial Vertica implementation.)
- Timestamps are used for inserts and deletes; otherwise, there are no data changes. (Without that kind of approach, the update strategy in Point #2 couldn’t be viable.) A big benefit to these timestamps is that you can assure integrity via “snapshot isolation”; i.e., by a virtual rollback to a recent point in time. Thus, Vertica can get away without any kind of locks or, for that matter, transaction/redo logs. Row-oriented Netezza uses a similar logless, lockless approach.
- Columnar data stores lend themselves to aggressive compression. After all, most sophisticated compression techniques depend upon deltas (or lack of delta!) vs. other values in the same column. And compression works a lot better when the column itself is sorted. Vertica’s compression is carried straight through into query processing. One benefit: It also allows for more use of on-processor Level 2 cache. (Efficient use of Level 2 cache gets mentioned to me a lot these days …)
- Data is stored in overlapping projection views, each of which is sorted by at least one of the columns in the view. Presorting obviously helps with query performance. Of course, this redundancy carries a penalty at load or update time. But the same is true of conventional RDBMS’s indices and, yes, materialized views.
- Data is partitioned “horizontally,” in a shared-nothing environment. I.e., different “rows” go to different nodes. Queries are resolved on each node, and the result sets are combined centrally, with no attempt to ship intermediate results from node to node. Despite the experience of other shared-nothing data warehouse vendors that this approach leads to bottlenecks, Mike is confident it works fine in Vertica’s case.
Obviously, my post title was exaggerated; nobody, including Mike, thinks row-oriented data stores are obsolete for OLTP. But what about data warehousing? Will an approach like Vertica’s eventually win versus, say, the shared-nothing row-oriented RDBMS leaders (that would be some combination of IBM, Teradata, Netezza, and DATAllegro, depending on what you mean by “leader”)? Well, apparently Vertica has a bunch of tests going on, at database sizes from the low 100s of gigabytes to the low 10s of terabytes. And of course they have those great-looking benchmark results, for which they swear they tuned competitor’s products with passionate care.
If I have to make an early guess, I’d say that the success of columnar systems will depend in no small part on what kind of data warehouse applications we’re talking about. Referencing a taxonomy I previously posted:
- Pinpoint data lookup doesn’t seem like a great fit for columnar systems. Indeed, traditional rows-and-B-trees would seem to be best.
- Constrained query and reporting would seem to be a sweet spot, even though it’s a sweet spot for some of the best competition as well.
- Cube-filling calculations involve big intermediate result sets. I’m not sure that’s a great fit for columnar systems.
- Hardcore tabular data crunching would seem in many cases to be another sweet spot, again against a lot of competition, at least in some of its sub-categories.
- Text and media search are best done by specialized systems that, at least in the case of text, wind up being quasi-columnar. The same goes for other specialty areas. Systems like Vertica’s have nothing to offer directly to these applications. However, it might be possible for Vertica to integrate with them fairly quickly, given that they’re starting from vaguely similar philosophical roots.
Comments
19 Responses to “Are row-oriented RDBMS obsolete?”
Leave a Reply
[…] More recently, the argument in that paper has been extended with a benchmark-filled follow-up based on another Stonebraker startup, Vertica. • • • […]
Curt,
I took a look at C-Store a while ago when Vertica first came on the scene. The idea that row-oriented databases are going to be superceded “real soon” by column-oriented has been pushed on and off for around 20 years. Sand and Sybase IQ are OK for small (sub-TB) data warehouses, but they just don’t scale beyond that. Am I missing something in Vertica that would change that?
In practice, our row-oriented DATAllegro appliance is CPU bound for most queries, so I/O isn’t really the bottleneck as it is with most systems. We’re also about to introduce compression to move the bar even further.
Stuart
DATAllegro
Hi Stuart!
Sybase IQ doesn’t scale because it isn’t properly parallellized. I can’t comment on Sand; they certainly claim to scale.
Kognitio, on the other hand, does show every sign of scaling with a columnar, shared-nothing architecture.
The one thing that worries me is what’s highlighted in Point #6 above. Just for what kinds of queries does or doesn’t the system scale? (I also don’t know the answer to that for Kognitio.) Otherwise, the story sounds pretty clean to me.
Best,
CAM
[…] … unless you think that is inherently an oxymoron. I thought I was doing well catching and expanding on a clever pop culture reference. But the folks at columnar DBMS start-up Vertica Systems may have topped that with their slogan […]
[…] IBM sent over a bunch of success stories recently, with DB2’s new aggressive compression prominently mentioned. Mike Stonebraker made a big point of Vertica’s compression when last we talked; other column-oriented data warehouse/mart software vendors (e.g. Kognitio, SAP, Sybase) get strong compression benefits as well. Other data warehouse/mart specialists are doing a lot with compression too, although some of that is governed by please-don’t-say-anything-good-about-us NDA agreements. […]
[…] I’m hard pressed to see why, for some applications, this wouldn’t have all the benefits of the full columnar architectures of, say, Vertica or Kognitio. That said, I can also envision other applications in which Vertica would offer large performance benefits by allowing redundant storage with a variety of sort orders. […]
Hi Curt,
Bloor Research recently published an excellent evaluation white paper on Sybase IQ, authored by Philip Howard, which addresses (among other subjects) the Sybase IQ approach to parallelization.
http://www.sybase.com/content/1035804/SybaseIQ-12.7-010407-wp.pdf
As for Sybase IQ’s ability to scale — it has been dramatically demonstrated in a number of benchmark exercises (up to 155 TB) and customer implementations (40+ TB in production). A few examples:
http://www.sybase.com/detail_list?id=49108
http://www.sybase.com/detail?id=1027323
http://www-03.ibm.com/systems/p/solutions/sybase/iq/index.html
The entry of Vertica and other players into the column-based database market helps to demonstrate the growth potential of this space. We can expect to see more such entrants as database sizes continue to increase and organizations continue to look for technology that can reliably handle their analytics requirements.
Phil Bowermaster
Sybase
Hi Phil,
Nice paper! Did you guys sponsor it? I didn’t see any disclosure statements about that, but I noticed that “evaluation” was in quotes in the title.
Either way, I’m a great admirer of Philip Howard’s unrelentingly optimistic view of technology, as per http://www.dbms2.com/2006/05/15/philip-howard-likes-viper/. And I wonder whether it’s really true that the appliance vendors don’t do tokenization/dictionary compression. If they don’t, they surely should, and probably will soon.
Seriously, I’d be interested to learn what unnatural acts you did or didn’t have to perform to scale that high. And I’d really like to learn about the complexity you do or don’t offer in text analysis, since I’ve long thought that columnar relational indexing and text indexing were apt to fit very well together.
CAM
Hi all,
well, BEFORE you have invent Vertica, and BEFORE Sybase have ship its column-oriented product, yet in 1998 year was introduced Valentina Database (www.paradigmasoft.com), with major development started at 1994-1995.
Intresting to compare 🙂
Hi Ruslan,
I’m trying to remember when Bob Epstein of Sybase first enthused to me about the Expressway acquisition, and I think it was a little earlier than the timeframe you’re suggesting.
Anyhow, after looking at your website I have a few suggestions:
1. If your main claim is speed, don’t have the benchmark link be dead.
2. Developer pricing is a bad business model in most markets.
3. Your web site doesn’t really say very much .
4. You need a copy editor who is a native English speaker.
Best regards,
CAM
Hi all,
where is the novelty of column-oriented DBMS? Is this storage architecture another name for vertical partitioning in traditional RDBMS?
Ileana
Hi Ileana,
You might want to look through http://www.dbms2.com/category/database-theory-practice/columnar-database-management/ for some ideas and answers. ParAccel and SAP would say that columnar architectures make memory-centric processing easier. Vertica and Infobright would say they make compression easier. DATAllegro and other row-based vendors, however, would offer the same skeptical questions you did.
Best,
CAM
Sybase IQ doesn’t scale beyond one TB… Damn I must tell my client that, they have been using Sybase IQ for a 7TB DWH for the past 3 years (40Tb raw data btw)…
Steve,
As I asked above — are there any unnatural acts of partitioning reflected in the SQL to get that kind of scalability?
Any serious DBMS can scale almost arbitrarily large if you just put a lot of database instances side by side …
CAM
[…] similar arguments to me a few days ago. They are not wholly unbiased; indeed, both are involved in Vertica Systems. With that caveat, they have an interesting three-part […]
[…] entirely in-memory and hence is limited in possible database size. Mike Stonebraker’s startup Vertica is of course the new kid on the block, and there are other columnar startups as well whose names […]
[…] http://www.theregister.co.uk/2009/04/10/ibm_system_s_super/ and http://www.informationweek.com/news/software/database/showArticle.jhtml?articleID=207801436 and http://www.dbms2.com/2007/01/22/are-row-oriented-rdbms-obsolete/ […]
[…] article and article and article […]
[…] columns themselves can be used as indexes in the usual Vertica-like […]