Load speeds and related issues in columnar DBMS
Please do not rely on the parts of the post below that are about ParAccel. See our February 18 post about ParAccel instead.
I’ve already posted about a chat I had with Mike Stonebraker regarding Vertica yesterday. I naturally raised the subject of load speed, unaware that Mike’s colleague Stan Zlodnik had posted at length about load speed the day before. Given that post, it seems timely to go into a bit more detail, and in particular to address three questions:
- Can columnar DBMS do operational BI?
- Can columnar DBMS do ELT (Extract-Load-Transform, as opposed to ETL)?
- Are columnar DBMS’ load speeds a problem other than in issues #1 and #2?
The basics of columnar data loading
I’m pretty familiar with three new columnar DBMS vendors — Vertica, ParAccel, and Infobright, all of whom are clients. Infobright’s architecture is quite unusual, however, so I’ll confine the discussion to ParAccel and Vertica, the latter of which is fresher in my mind.
Both ParAccel and Vertica ingest data as fast as they can, arrange and compress it into the same format that will be used on disk, and blast it onto disk in small batches. How small? Vertica always talks of a gigabyte per node, but I don’t know whether that’s a round number or a fairly precise figure. If it’s precise, then the 3-5 megabytes/second figure in my other post suggests one batch every few minutes. ParAccel, meanwhile, talks about five seconds to tens of minutes latency, depending on update speed, but in their case I think higher volumes lead to higher latency.
Both vendors, of course, maintain tables in memory for data that hasn’t made it to disk yet. But here there seems to be a huge difference. Although ParAccel has both in-memory and disk-based modes, it doesn’t appear to run in hybrid form. (I would imagine they’re working hard on that for future releases. If not, they should be.) Thus, ParAccel wouldn’t be suitable right now for an application like web personalization, that does analytics on very recent data.
Vertica, however, makes in-memory data immediately available; queries just have to look for the data in two places. If I have this straight, Vertica’s in-memory structure was originally planned to be a row store; it actually is a column store; and they’re researching whether to switch to a row store after all. I think the main issue is speed of loading; for example, unlike the Vertica disk store, the in-memory column store is not compressed.
Vertica insists that adding data on disk does not involve any kind of sort, and hence merging is very fast. At least one Vertica competitor claims that at least one POC proves otherwise, but maybe there was some kind of misunderstanding or benchmarking snafu. In any case, we are not talking about Sybase IQ-style bitmaps, in which the whole index might need to be rearranged each time a new data value is discovered. (At least, that was the case in the old days; I imagine Sybase is well beyond that drawback now.)
Can columnar DBMS support operational BI?
“Operational BI” means something different to every marketer who uses it. Sometimes it means “BI done against a DBMS that is being transactionally updated.” Clearly, that leaves out the disk-centric version of ParAccel’s product, and purely in-memory operation without a persistence story obviously isn’t transactional either. On the other hand, Vertica’s hybrid memory-centric approach can in principle support transactions just fine. That said, I imagine its transaction-volume limits are well below those of the better shared-nothing row-based data warehouse DBMS.
Another meaning of “operational BI” is “BI-style queries done in the flow of operational applications.” There’s no particular reason column stores would be worse than row stores at those.
ELT in columnar data stores
More and more, data warehouse DBMS specialists — and their users — talk of ETL being replaced by ELT or perhaps ETLT. The point is that for the same reasons data warehouse appliances and DBMS are fast at everything else, they’re fast at initial data transformations as well. Indeed, they’re particularly good at queries which examine every row in a whole table, and hence particularly well-suited for transformations of that kind.
However, I hear of this for row-based systems, not columnar ones. Much of the reason is surely market penetration; new-generation columnar vendors are less established than row-based ones such as Teradata, Netezza, or even (despite its small customer base) DATAllegro. But I also suspect technical reasons. Columnar vendors generally do a lot more massaging of data than row-based vendors in their basic data load and indexing. So integrating that with efficient ELT may not be a Release 1/Release 2 kind of task.
Do columnar DBMS have any other weaknesses than those already cited?
As I just said, new-generation columnar DBMS are less mature than their row-based competitors. So from that standpoint, of course they have a variety of failings. But if we leave general maturity aside, I think most of columnar systems competitive weaknesses will indeed come in quasi-transactional contexts. Row-based RDBMS are pretty ideal for many transactional purposes, both in the writing and reading of rows, and columnar systems may never wholly overcome that. When it comes to analytic queries, however, columnar is the approach with the inherent advantage.
Comments
Leave a Reply