VectorWise, Ingres, and MonetDB
I talked with Peter Boncz and Marcin Zukowski of VectorWise last Wednesday, but didn’t get around to writing about VectorWise immediately. Since then, VectorWise and its partner Ingres have gotten considerable coverage, especially from an enthusiastic Daniel Abadi. Basic facts that you may already know include:
- VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.)
- VectorWise is due to be introduced in 2010. (Peter Boncz said that to me more clearly than I’ve seen in other coverage.)
- VectorWise and Ingres have a deal in which Ingres will at least be the exclusive seller of the VectorWise technology, and hopefully will buy the whole company.
- Notwithstanding that it was once named something like “MonetDB,” VectorWise actually is not the same thing as MonetDB, another open source columnar analytic DBMS from the same research group.
- The MonetDB and VectorWise research groups consist in large part of academics in Holland, specifically at CWI (Centrum voor Wiskunde en Informatica). But Ingres has a research group working on the project too. (Right now there are about seven “highly experienced” people each on the VectorWise and Ingres sides, although at least the VectorWise folks aren’t all full-time. More are being added.)
- Ingres and VectorWise haven’t agreed exactly how VectorWise and Ingres Classic will play together in the Ingres product line. (All of the obvious possibilities are still on the table.)
- VectorWise is shared-everything, just as Ingres is. But plans — still tentative — are afoot to integrate VectorWise with MapReduce in Daniel Abadi’s HadoopDB project.
The MonetDB project is led by Martin Kersten, with whom I chatted at SIGMOD in June (standing up and not taking notes, so I may have some details wrong). I get the impression, based on that conversation, my VectorWise call, and other data:
- Martin has been researching analytic DBMS (mainly but not only relational) since the late 1970s, and has been based at CWI since 1985.
- Peter Boncz has been either second in command of that crew or close to it.
- Martin Kersten, Peter Boncz, and the CWI/MonetDB team in general have gotten all sorts of computer science glory for their work.
- Martin has enjoyed generously stable government research funding for his group, but has found commercialization of the technology more difficult than he might at, stay, Stanford. The figure of 15 MonetDB researchers comes to mind, although I see from Martin’s bio that he oversees a team of ~55 in total.
- One early attempt at commercializing MonetDB turned into a company called Data Distilleries that was sold to SPSS. Peter Boncz was chief architect of Data Distilleries.
- Besides VectorWise, there are at least two other recent spin-off companies from the MonetDB project. One is a zero-headcount shell, set up to facilitate MonetDB project members (and others) consulting to users of the open source MonetDB technology. The other is in stealth mode, focusing on some vertical market.
I further get the impression that VectorWise was actually Marcin Zukowksi’s Master’s Ph.D project, with Peter Boncz being his advisor. VectorWise also boasts another Peter Boncz student, who wrote about updating column stores.
As one might expect from the name, VectorWise does vector processing. I.e., the hard part of Marcin’s work was developing vectorized algorithms for one SQL operation after another. Vectorization, pipelining, and FPGAs might all seem to go together — XtremeData certainly seems to think so — but the VectorWise folks preferred to develop for Intel CPUs anyway, for pretty much the usual reasons. Another major theme is trying to get the right things into CPU cache, because in their opinion RAM cache is just sooooo painfully slow.
Our discussion of VectorWise’s compression was interesting. Highlights included:
- The design requirement is that decompression work at a rate of 3 gigabytes/second or so. That way the system is faster overall than if it operated at 1 gigabyte/second on uncompressed data, which I gather is the alternative.
- VectorWise takes 4-5 steps CPU cycles to decompress a tuple.
- VectorWise says it sacrificed compression ratio to achieve speed. That said, VectorWise claims 3-4X compression on TPC-H data, which is no worse than what ParAccel reported, and enjoys higher compression rates on other kinds of data.
- VectorWise decompresses data before manipulating it, and claims that the advantages of operating on compressed data are only significant if — like Vertica but apparently unlike VectorWise — the database stores columns in multiple sort orders each.
- VectorWise’s compression is mainly on numerical and numerical-like (e.g. date) datatypes. An exception is that VectorWise uses dictionary compression on string data when it makes sense to do so.
Other notes include:
- VectorWise has technology akin to Microsoft SQL Server’s Shared Scans, in which multiple queries that require similar table scans don’t have to repeat all the redundant scanning work. I need to get better at figuring out which other analytic DBMS do similar things.
- While VectorWise hasn’t yet been open-sourced, its code is in the hands of some other academic institutions, used mainly for computer science research (as opposed to, say, as a data store for some kind of scientific experiment).
- VectorWise’s scalability has only been tested up to eight cores.
Comments
12 Responses to “VectorWise, Ingres, and MonetDB”
Leave a Reply
[…] VectorWise guys also told me they are looking forward to seeing how the two projects work together. […]
the advantages of operating on compressed data are only significant if the database stores columns in multiple sort orders each.
If your table has few dimensions, this makes no sense. But for high dimensional tables, it rings true. Indeed, columnar compression often comes through run-length encoding (RLE), after sorting (lexicographically). Yet, only the first few columns (in sorting order) will end up compressible by RLE after sorting them.
See for example:
Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering (to appear).
http://arxiv.org/abs/0901.3751
http://www.slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them
This suggests that they are not relying much on RLE. It might be that vector processing does not work well in conjunction with RLE?
Hi Curt,
Thank you for a nice writeup on VectorWise. While generally correct, here are some clarifications:
– the VectorWise technology belongs fully to our company (no academic institution, including CWI, can control it)
– the MonetDB open-source system originated from the PhD research of Peter Boncz under supervision of Martin Kersten, while the VectorWise database engine is a technology generation later and came out of my own PhD (not MSc) research, supervised in turn by Peter Boncz. Other CWI group members also have significant contributions to both projects.
– we do hope to make VectorWise technology available as early as possible, and 2010 is very possible, but please do not treat it as an official plan
– as for the string compression, we use something called PDICT, which is a new – outlier resistant – form of dictionary encoding.
– like you wrote, the main thing about the compression methods in VectorWise is that they are much faster than existing methods. As for the performance, we take a few “CPU cycles” (not “steps”) for one element. Links to publications with more technical info can be found on: http://www.vectorwise.com/index_js.php?page=company_origins
– the place to visit for more info on the Ingres VectorWise project is http://www.ingres.com/vectorwise
Best regards,
Marcin Zukowski
Thanks, Marcin!
I edited in two corrections (Ph.D, CPU cycles).
Best,
CAM
@Daniel
One thing to note is that the opinion of working on compressed data sets is mostly useful for the major ordering columns only refers to the RLE compression. Like you write, in cases with large domain cardinality RLE won’t do much for non-sorted data.
Still, other forms of compression can be used and data compressed with those can be analyzed without decompressing, see e.g. http://scholar.google.com/scholar?q=%22The+Implementation+and+Performance+of+Compressed+Databases.%22
m.
There’s a 2008 talk by Peter Boncz about MonetDB/X100 project that illustrates principles that seem to be used by VectorWise’s DBMS:
http://www.youtube.com/watch?v=yrLd-3lnZ58
Cool stuff,
E.
[…] Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases! […]
[…] where X=2. Column-store guru Abadi has repeatedly signaled his intention to try out HadoopDB with VectorWise at the nodes instead. (Recall that VectorWise is shared-everything.) It will be interesting to see […]
[…] Martin Kersten emailed a response to my post on issues in scientific data management. With his permission, I’ve lightly edited it, and am posting it below. […]
[…] caught up with me for a regrettably brief call. Peter gave me the strong impression that what I’d written in the past about VectorWise had been and remained accurate, so I focused on filling in the gaps. Highlights […]
[…] Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that […]
[…] 2 techie founders out of Oracle, plus Marcin Zukowski. […]