August 4, 2009

VectorWise, Ingres, and MonetDB

I talked with Peter Boncz and Marcin Zukowski of VectorWise last Wednesday, but didn’t get around to writing about VectorWise immediately. Since then, VectorWise and its partner Ingres have gotten considerable coverage, especially from an enthusiastic Daniel Abadi. Basic facts that you may already know include:

VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.)
VectorWise is due to be introduced in 2010. (Peter Boncz said that to me more clearly than I’ve seen in other coverage.)
VectorWise and Ingres have a deal in which Ingres will at least be the exclusive seller of the VectorWise technology, and hopefully will buy the whole company.
Notwithstanding that it was once named something like “MonetDB,” VectorWise actually is not the same thing as MonetDB, another open source columnar analytic DBMS from the same research group.
The MonetDB and VectorWise research groups consist in large part of academics in Holland, specifically at CWI (Centrum voor Wiskunde en Informatica). But Ingres has a research group working on the project too. (Right now there are about seven “highly experienced” people each on the VectorWise and Ingres sides, although at least the VectorWise folks aren’t all full-time. More are being added.)
Ingres and VectorWise haven’t agreed exactly how VectorWise and Ingres Classic will play together in the Ingres product line. (All of the obvious possibilities are still on the table.)
VectorWise is shared-everything, just as Ingres is. But plans — still tentative — are afoot to integrate VectorWise with MapReduce in Daniel Abadi’s HadoopDB project.

The MonetDB project is led by Martin Kersten, with whom I chatted at SIGMOD in June (standing up and not taking notes, so I may have some details wrong). I get the impression, based on that conversation, my VectorWise call, and other data:

Martin has been researching analytic DBMS (mainly but not only relational) since the late 1970s, and has been based at CWI since 1985.
Peter Boncz has been either second in command of that crew or close to it.
Martin Kersten, Peter Boncz, and the CWI/MonetDB team in general have gotten all sorts of computer science glory for their work.
Martin has enjoyed generously stable government research funding for his group, but has found commercialization of the technology more difficult than he might at, stay, Stanford. The figure of 15 MonetDB researchers comes to mind, although I see from Martin’s bio that he oversees a team of ~55 in total.
One early attempt at commercializing MonetDB turned into a company called Data Distilleries that was sold to SPSS. Peter Boncz was chief architect of Data Distilleries.
Besides VectorWise, there are at least two other recent spin-off companies from the MonetDB project. One is a zero-headcount shell, set up to facilitate MonetDB project members (and others) consulting to users of the open source MonetDB technology. The other is in stealth mode, focusing on some vertical market.

I further get the impression that VectorWise was actually Marcin Zukowksi’s Master’s Ph.D project, with Peter Boncz being his advisor. VectorWise also boasts another Peter Boncz student, who wrote about updating column stores.

As one might expect from the name, VectorWise does vector processing. I.e., the hard part of Marcin’s work was developing vectorized algorithms for one SQL operation after another. Vectorization, pipelining, and FPGAs might all seem to go together — XtremeData certainly seems to think so — but the VectorWise folks preferred to develop for Intel CPUs anyway, for pretty much the usual reasons. Another major theme is trying to get the right things into CPU cache, because in their opinion RAM cache is just sooooo painfully slow.

Our discussion of VectorWise’s compression was interesting. Highlights included:

The design requirement is that decompression work at a rate of 3 gigabytes/second or so. That way the system is faster overall than if it operated at 1 gigabyte/second on uncompressed data, which I gather is the alternative.
VectorWise takes 4-5 steps CPU cycles to decompress a tuple.
VectorWise says it sacrificed compression ratio to achieve speed. That said, VectorWise claims 3-4X compression on TPC-H data, which is no worse than what ParAccel reported, and enjoys higher compression rates on other kinds of data.
VectorWise decompresses data before manipulating it, and claims that the advantages of operating on compressed data are only significant if — like Vertica but apparently unlike VectorWise — the database stores columns in multiple sort orders each.
VectorWise’s compression is mainly on numerical and numerical-like (e.g. date) datatypes. An exception is that VectorWise uses dictionary compression on string data when it makes sense to do so.

Other notes include:

VectorWise has technology akin to Microsoft SQL Server’s Shared Scans, in which multiple queries that require similar table scans don’t have to repeat all the redundant scanning work. I need to get better at figuring out which other analytic DBMS do similar things.
While VectorWise hasn’t yet been open-sourced, its code is in the hands of some other academic institutions, used mainly for computer science research (as opposed to, say, as a data store for some kind of scientific experiment).
VectorWise’s scalability has only been tested up to eight cores.

Categories: Actian and Ingres, Analytic technologies, Columnar database management, Data warehousing, Database compression, MonetDB, Open source, Theory and architecture, VectorWise

Subscribe to our complete feed!

Comments

12 Responses to “VectorWise, Ingres, and MonetDB”

Vertica’s version of MapReduce integration | DBMS2 -- DataBase Management System Services on August 4th, 2009 6:29 am

[…] VectorWise guys also told me they are looking forward to seeing how the two projects work together. […]
Daniel Lemire on August 4th, 2009 8:54 am

the advantages of operating on compressed data are only significant if the database stores columns in multiple sort orders each.

If your table has few dimensions, this makes no sense. But for high dimensional tables, it rings true. Indeed, columnar compression often comes through run-length encoding (RLE), after sorting (lexicographically). Yet, only the first few columns (in sorting order) will end up compressible by RLE after sorting them.

See for example:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering (to appear).
http://arxiv.org/abs/0901.3751
http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them

This suggests that they are not relying much on RLE. It might be that vector processing does not work well in conjunction with RLE?
Marcin Zukowski on August 4th, 2009 11:31 am

Hi Curt,

Thank you for a nice writeup on VectorWise. While generally correct, here are some clarifications:

– the VectorWise technology belongs fully to our company (no academic institution, including CWI, can control it)

– the MonetDB open-source system originated from the PhD research of Peter Boncz under supervision of Martin Kersten, while the VectorWise database engine is a technology generation later and came out of my own PhD (not MSc) research, supervised in turn by Peter Boncz. Other CWI group members also have significant contributions to both projects.

– we do hope to make VectorWise technology available as early as possible, and 2010 is very possible, but please do not treat it as an official plan

– as for the string compression, we use something called PDICT, which is a new – outlier resistant – form of dictionary encoding.

– like you wrote, the main thing about the compression methods in VectorWise is that they are much faster than existing methods. As for the performance, we take a few “CPU cycles” (not “steps”) for one element. Links to publications with more technical info can be found on: http://www.vectorwise.com/index_js.php?page=company_origins

– the place to visit for more info on the Ingres VectorWise project is http://www.ingres.com/vectorwise

Best regards,
Marcin Zukowski
Curt Monash on August 4th, 2009 11:58 am

Thanks, Marcin!

I edited in two corrections (Ph.D, CPU cycles).

Best,

CAM
Marcin Zukowski on August 4th, 2009 3:36 pm

@Daniel

One thing to note is that the opinion of working on compressed data sets is mostly useful for the major ordering columns only refers to the RLE compression. Like you write, in cases with large domain cardinality RLE won’t do much for non-sorted data.

Still, other forms of compression can be used and data compressed with those can be analyzed without decompressing, see e.g. http://scholar.google.com/scholar?q=%22The+Implementation+and+Performance+of+Compressed+Databases.%22

m.
Edward on August 4th, 2009 10:03 pm

There’s a 2008 talk by Peter Boncz about MonetDB/X100 project that illustrates principles that seem to be used by VectorWise’s DBMS:

http://www.youtube.com/watch?v=yrLd-3lnZ58

Cool stuff,
E.
Do hash tables work in constant time? on August 18th, 2009 10:15 am

[…] Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases! […]
HadoopDB | DBMS2 -- DataBase Management System Services on September 19th, 2009 8:05 pm

[…] where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see […]
Martin Kersten on issues in scientific data management | DBMS2 -- DataBase Management System Services on October 3rd, 2009 6:33 am

[…] Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below. […]
Ingres VectorWise technical highlights | DBMS2 -- DataBase Management System Services on June 11th, 2010 7:28 am

[…] caught up with me for a regrettably brief call. Peter gave me the strong impression that what I’d written in the past about VectorWise had been and remained accurate, so I focused on filling in the gaps. Highlights […]
Actian Vector Hadoop Edition | DBMS 2 : DataBase Management System Services on September 30th, 2014 1:50 am

[…] Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that […]
Snowflake Computing | DBMS 2 : DataBase Management System Services on October 22nd, 2014 4:45 am

[…] 2 techie founders out of Oracle, plus Marcin Zukowski. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

VectorWise, Ingres, and MonetDB

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin