Columnar database management

Analysis of products and issues in column-oriented database management systems. Related subjects include:

September 21, 2009

Notes on the Oracle Database 11g Release 2 white paper

The Oracle Database 11g Release 2 white paper I cited a couple of weeks ago has evidently been edited, given that a phrase I quoted last month is no longer to be found. Anyhow, here are some quotes from and comments on what evidently is the latest version. Read more

Categories: Analytic technologies, Archiving and information preservation, Cache, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Memory-centric data management, OLTP, Oracle, Oracle TimesTen, Parallelization, Solid-state memory, Storage, Theory and architecture

8 Comments

September 13, 2009

HadoopDB

Despite a thoughtful heads-up from Daniel Abadi at the time of his original posting about HadoopDB, I’m just getting around to writing about it now. HadoopDB is a research project carried out by a couple of Abadi’s students. Further research is definitely planned. But it seems too early to say that HadoopDB will ever get past the “research and oh by the way the code is open sourced” stage and become a real code line — whether commercialized, open source, or both.

The basic idea of HadoopDB is to put copies of a DBMS at different nodes of a grid, and use Hadoop to parcel work among them. Major benefits when compared with massively parallel DBMS are said to be:

Open/cheap/free
Query fault-tolerance
The related concept of tolerating node degradation that isn’t an outright node failure.

HadoopDB has actually been built with PostgreSQL. That version achieved performance well below that of a commercial DBMS “DBX”, where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see how that configuration performs.

The real opportunity for HadoopDB, however, in my opinion may lie elsewhere. Read more

Categories: Analytic technologies, Columnar database management, Data models and architecture, Data types, Data warehousing, Database diversity, Hadoop, MapReduce, Open source, Parallelization, PostgreSQL, Scientific research, Structured documents, Theory and architecture

5 Comments

September 3, 2009

Oracle Exadata hybrid columnar compression

Oracle Database 11g Release 2 is out, and as usual I wasn’t briefed — perhaps because Oracle is more scared than its competitors are of hard questions, perhaps for some other reason entirely.* Anyhow, Oracle Database 11 Release 2 contains an Exadata-only feature called hybrid columnar compression. The Oracle Database 11g Release 2 white paper says “data is grouped, ordered, and stored one column at a time.” But Kevin Closson clarifies:

The word hybrid is important.

Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.

So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.

That sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” provides the compression benefits of column stores, but not column stores’ I/O benefits, and also not any kind of in-memory compression. Read more

Categories: Columnar database management, Data warehousing, Database compression, Exadata, Oracle, Theory and architecture

20 Comments

August 25, 2009

Sybase IQ technical highlights

General highlights of the Sybase IQ technical story include:

Sybase IQ is an analytic DBMS with a columnar/column-store architecture
Unlike most analytic DBMS, Sybase IQ has a shared-disk architecture.
The Sybase IQ indexing story is a bit complicated, with a bunch of different index kinds. Most are focused on columns with low cardinality, and it least in some cases are a lot like bitmaps. (Sybase IQ when first introduced was a pure bitmap index product, with a single index type “Fast Project”.) But one index kind, “High Group” — designed for columns with high cardinality – is an exception to most generalities about other Sybase IQ index kinds, and instead is more akin to a b-tree.
Unlike Vertica, Sybase stores each column of data only once. I don’t see how it would make sense to have multiple indexes on the same column, but I didn’t actually ask whether doing so is possible or common.
Sybase estimates that Sybase IQ requires ¼ the DBA effort of, say, Oracle. (Frankly, that’s not a particularly good figure.) Obviously, this is just a broad-brush average.
Sybase recently repurposed an acquired ETL tool to be focused on Sybase IQ. IQ of course also works with various third-party tools, certified or otherwise.
Sybase’s Power Designer CASE (Computer-Aided Software Engineering)/database design tool works with Sybase IQ.
Sybase is proud of Sybase IQ’s new in-database analytics capabilities, but I haven’t yet grasped what, if anything, is differentiated about them.
Sybase has an ILM (Information Lifecycle Management) story built around the point that different columns can be stored on different kinds of media.

Highlights of the Sybase IQ compression story include: Read more

Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Sybase, Theory and architecture

11 Comments

August 21, 2009

Kickfire’s FPGA-based technical strategy

Kickfire’s basic value proposition is that, if you have a data warehouse in the 100s of gigabytes, they’ll sell you – for $32,000 – a tiny box that solves all your query performance problems, as per the Kickfire spec sheet. And Kickfire backs that up with a pretty cool product design. However, thanks in no small part to what was heretofore Kickfire’s penchant for self-defeating secrecy, the Kickfire story is not widely appreciated.

Fortunately, Kickfire is getting over its secrecy kick. And so, here are some Kickfire technical basics.

Kickfire is MySQL-based, with all the SQL functionality and lack of functionality that entails.
The Kickfire/MySQL DBMS is columnar, with the usual benefits in compression and I/O reduction.
Kickfire is based on FPGAs (Field-Programmable Gate Arrays).
The Kickfire DBMS is ACID-compliant.
Kickfire runs only as a single-box appliance.
While Kickfire earlier estimated that, at least for data sets that compressed well, a Kickfire box could hold 3-10 terabytes of user data, more recent figures I’ve heard from Kickfire have been in the 1-1 /2 terabyte range. (Edit: Karl Van Der Bergh subsequently wrote in to say that the 1 1/2 TB is raw disk figure, not user data.)

The new information there is that Kickfire relies on an FPGA; Read more

Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Kickfire, MySQL, Theory and architecture

16 Comments

August 4, 2009

FlexStore and the rest of Vertica 3.5

Today, Vertica is announcing its 3.5 release, timed in line with a TDWI conference. Vertica 3.5 is scheduled to go into beta test in mid-August and be released to general availability in early October. Vertica 3.5 highlights include:

Vertica/MapReduce integration, which I’m covering in a separate post.
A new storage architecture called Vertica FlexStore, which seems to boil down essentially to three things:
- A sort of row/column hybridization — Vertica would probably prefer to call it something like a column clustering feature — that I’m also covering in a separate post.
- The beginnings of a multi-temperature capability, somewhat akin to Teradata Virtual Storage.
- Enhancements to Vertica’s WOS (Write-Optimized Store, the in-memory part of Vertica that first receives updates). I don’t understand WOS architecture well enough to write about that yet.
Load-balancing, to route queries evenly among Vertica nodes — probably just round-robin — rather than having them just be processed by whichever node happens to receive them.

Categories: Columnar database management, Data warehousing, Vertica Systems

11 Comments

August 4, 2009

PAX Analytica? Row- and column-stores begin to come together

Column-store proponents are prone to argue, in effect, that the only reason to implement an analytic DBMS with row-based storage is laziness. Their case generally runs along the lines:

Analytic queries commonly return only a fraction of all possible columns.
Only returning the columns needed
- Saves I/O
- Saves cache space
- Reduces processing
- Facilitates compression
Presumably all those row-based MPP vendors just went row-based because they had a fine row-based DBMS (usually but not always PostgreSQL) to build on.

Pushbacks to this argument from row-based vendors include:

Yes, but it’s harder to update a column store
Yes, but there are more steps to retrieving a bunch of columns than there are to retrieving the same information from row stores

Categories: Analytic technologies, Columnar database management, Data warehousing, Theory and architecture, VectorWise, Vertica Systems

11 Comments

August 4, 2009

Vertica’s version of MapReduce integration

I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:

By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.

Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically: Read more

Categories: Analytic technologies, Cloudera, Columnar database management, Data warehousing, Hadoop, Investment research and trading, MapReduce, Parallelization, Theory and architecture, VectorWise, Vertica Systems, Web analytics

5 Comments

August 4, 2009

VectorWise, Ingres, and MonetDB

I talked with Peter Boncz and Marcin Zukowski of VectorWise last Wednesday, but didn’t get around to writing about VectorWise immediately. Since then, VectorWise and its partner Ingres have gotten considerable coverage, especially from an enthusiastic Daniel Abadi. Basic facts that you may already know include:

VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.)
VectorWise is due to be introduced in 2010. (Peter Boncz said that to me more clearly than I’ve seen in other coverage.)
VectorWise and Ingres have a deal in which Ingres will at least be the exclusive seller of the VectorWise technology, and hopefully will buy the whole company.
Notwithstanding that it was once named something like “MonetDB,” VectorWise actually is not the same thing as MonetDB, another open source columnar analytic DBMS from the same research group.
The MonetDB and VectorWise research groups consist in large part of academics in Holland, specifically at CWI (Centrum voor Wiskunde en Informatica). But Ingres has a research group working on the project too. (Right now there are about seven “highly experienced” people each on the VectorWise and Ingres sides, although at least the VectorWise folks aren’t all full-time. More are being added.)
Ingres and VectorWise haven’t agreed exactly how VectorWise and Ingres Classic will play together in the Ingres product line. (All of the obvious possibilities are still on the table.)
VectorWise is shared-everything, just as Ingres is. But plans — still tentative — are afoot to integrate VectorWise with MapReduce in Daniel Abadi’s HadoopDB project.

Categories: Actian and Ingres, Analytic technologies, Columnar database management, Data warehousing, Database compression, MonetDB, Open source, Theory and architecture, VectorWise

12 Comments

July 8, 2009

While I’m venting about benchmarks

Late last year, Vertica made hoo-hah about what it called a world-record data warehouse load speed benchmark. I wrote at the time that this showed Vertica wasn’t painfully slow at loading, always a concern with column stores. But otherwise I mocked the idea that there was something useful to be learned from the whole exercise.

Well, guess what? In a throwaway line in a comment on Daniel Abadi’s blog, Barry Zane of ParAccel pointed out

we posted a load rate of almost 9TB/hour, which is, of course record breaking on its own

Quite right.

I hope the nonsense stops there, but I’m not optimistic …

Categories: Benchmarks and POCs, Columnar database management, Data integration and middleware, EAI, EII, ETL, ELT, ETLT, Vertica Systems

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Columnar database management

Notes on the Oracle Database 11g Release 2 white paper

HadoopDB

Oracle Exadata hybrid columnar compression

Sybase IQ technical highlights

Kickfire’s FPGA-based technical strategy

FlexStore and the rest of Vertica 3.5

PAX Analytica? Row- and column-stores begin to come together

Vertica’s version of MapReduce integration

VectorWise, Ingres, and MonetDB

While I’m venting about benchmarks

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin