Database compression
Analysis of technology that compresses data within a database management system. Related subjects include:
Notes on the Oracle Database 11g Release 2 white paper
The Oracle Database 11g Release 2 white paper I cited a couple of weeks ago has evidently been edited, given that a phrase I quoted last month is no longer to be found. Anyhow, here are some quotes from and comments on what evidently is the latest version. Read more
Oracle Exadata hybrid columnar compression
Oracle Database 11g Release 2 is out, and as usual I wasn’t briefed — perhaps because Oracle is more scared than its competitors are of hard questions, perhaps for some other reason entirely.* Anyhow, Oracle Database 11 Release 2 contains an Exadata-only feature called hybrid columnar compression. The Oracle Database 11g Release 2 white paper says “data is grouped, ordered, and stored one column at a time.” But Kevin Closson clarifies:
The word hybrid is important.
Rows are still used. They are stored in an object called a Compression Unit. Compression Units can span multiple blocks. Like values are stored in the compression unit with metadata that maps back to the rows.
So, “hybrid” is the word. But, none of that matters as much as the effectiveness. This form of compression is extremely effective.
That sounds a whole lot like PAX. Specifically, in Oracle’s case I would guess “hybrid columnar compression” provides the compression benefits of column stores, but not column stores’ I/O benefits, and also not any kind of in-memory compression. Read more
Categories: Columnar database management, Data warehousing, Database compression, Exadata, Oracle, Theory and architecture | 20 Comments |
Sybase IQ technical highlights
General highlights of the Sybase IQ technical story include:
- Sybase IQ is an analytic DBMS with a columnar/column-store architecture
- Unlike most analytic DBMS, Sybase IQ has a shared-disk architecture.
- The Sybase IQ indexing story is a bit complicated, with a bunch of different index kinds. Most are focused on columns with low cardinality, and it least in some cases are a lot like bitmaps. (Sybase IQ when first introduced was a pure bitmap index product, with a single index type “Fast Project”.) But one index kind, “High Group” — designed for columns with high cardinality – is an exception to most generalities about other Sybase IQ index kinds, and instead is more akin to a b-tree.
- Unlike Vertica, Sybase stores each column of data only once. I don’t see how it would make sense to have multiple indexes on the same column, but I didn’t actually ask whether doing so is possible or common.
- Sybase estimates that Sybase IQ requires ¼ the DBA effort of, say, Oracle. (Frankly, that’s not a particularly good figure.) Obviously, this is just a broad-brush average.
- Sybase recently repurposed an acquired ETL tool to be focused on Sybase IQ. IQ of course also works with various third-party tools, certified or otherwise.
- Sybase’s Power Designer CASE (Computer-Aided Software Engineering)/database design tool works with Sybase IQ.
- Sybase is proud of Sybase IQ’s new in-database analytics capabilities, but I haven’t yet grasped what, if anything, is differentiated about them.
- Sybase has an ILM (Information Lifecycle Management) story built around the point that different columns can be stored on different kinds of media.
Highlights of the Sybase IQ compression story include: Read more
Categories: Analytic technologies, Columnar database management, Data warehousing, Database compression, EAI, EII, ETL, ELT, ETLT, Sybase, Theory and architecture | 11 Comments |
Kickfire’s FPGA-based technical strategy
Kickfire’s basic value proposition is that, if you have a data warehouse in the 100s of gigabytes, they’ll sell you – for $32,000 – a tiny box that solves all your query performance problems, as per the Kickfire spec sheet. And Kickfire backs that up with a pretty cool product design. However, thanks in no small part to what was heretofore Kickfire’s penchant for self-defeating secrecy, the Kickfire story is not widely appreciated.
Fortunately, Kickfire is getting over its secrecy kick. And so, here are some Kickfire technical basics.
- Kickfire is MySQL-based, with all the SQL functionality and lack of functionality that entails.
- The Kickfire/MySQL DBMS is columnar, with the usual benefits in compression and I/O reduction.
- Kickfire is based on FPGAs (Field-Programmable Gate Arrays).
- The Kickfire DBMS is ACID-compliant.
- Kickfire runs only as a single-box appliance.
- While Kickfire earlier estimated that, at least for data sets that compressed well, a Kickfire box could hold 3-10 terabytes of user data, more recent figures I’ve heard from Kickfire have been in the 1-1 /2 terabyte range. (Edit: Karl Van Der Bergh subsequently wrote in to say that the 1 1/2 TB is raw disk figure, not user data.)
The new information there is that Kickfire relies on an FPGA; Read more
Categories: Analytic technologies, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Kickfire, MySQL, Theory and architecture | 16 Comments |
VectorWise, Ingres, and MonetDB
I talked with Peter Boncz and Marcin Zukowski of VectorWise last Wednesday, but didn’t get around to writing about VectorWise immediately. Since then, VectorWise and its partner Ingres have gotten considerable coverage, especially from an enthusiastic Daniel Abadi. Basic facts that you may already know include:
- VectorWise, the product, will be an open-source columnar analytic DBMS. (But that’s not quite true. Pending productization, it’s more accurate to call the VectorWise technology a row/column hybrid.)
- VectorWise is due to be introduced in 2010. (Peter Boncz said that to me more clearly than I’ve seen in other coverage.)
- VectorWise and Ingres have a deal in which Ingres will at least be the exclusive seller of the VectorWise technology, and hopefully will buy the whole company.
- Notwithstanding that it was once named something like “MonetDB,” VectorWise actually is not the same thing as MonetDB, another open source columnar analytic DBMS from the same research group.
- The MonetDB and VectorWise research groups consist in large part of academics in Holland, specifically at CWI (Centrum voor Wiskunde en Informatica). But Ingres has a research group working on the project too. (Right now there are about seven “highly experienced” people each on the VectorWise and Ingres sides, although at least the VectorWise folks aren’t all full-time. More are being added.)
- Ingres and VectorWise haven’t agreed exactly how VectorWise and Ingres Classic will play together in the Ingres product line. (All of the obvious possibilities are still on the table.)
- VectorWise is shared-everything, just as Ingres is. But plans — still tentative — are afoot to integrate VectorWise with MapReduce in Daniel Abadi’s HadoopDB project.
Categories: Actian and Ingres, Analytic technologies, Columnar database management, Data warehousing, Database compression, MonetDB, Open source, Theory and architecture, VectorWise | 12 Comments |
Hasso Plattner calls for in-memory OLTP column stores
Former SAP CEO Hasso Plattner has written a paper called A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, in association with a SIGMOD keynote address.* The approach Plattner advocates is an MPP in-memory column store, presumably somewhat akin to SAP’s frequently renamed Business Warehouse Accelerator/Business Intelligence Accelerator/BWA/BIA/Son-of-TREX technology. There also are strong similarities to the MPP in-memory row store project H-Store/VoltDB, although I don’t know whether Plattner would go so far as to adopt the H-Store view that all transactions should run in stored procedures. Unsurprisingly, SAP applications are used as the OLTP paradigm throughout.
*Thanks to Dave Kellogg for tipping me off to Plattner’s paper. I only went to two SIGMOD sessions, neither of which was Plattner’s. Nobody actually mentioned Plattner’s talk to me when I was down at SIGMOD.
Perhaps the most interesting part is Plattner’s claim that what’s demanding about OLTP isn’t database updating per se, but rather maintaining aggregates for quick-response analytics. In his main example of that point, Plattner proposes a real-life “more than 18” table schema, of which 2 are base tables, and (most of?) the rest are materialized views that his proposed database architecture dispenses with (because analytic performance is sufficiently good without them). Thus, Plattner’s core columnar argument seemingly is
columnar –> natively fast analytics –> no need to maintain aggregates –> much lower update burden.
That said — if Plattner’s paper contained a clear statement of how much more expensive it is to insert or update a single row in a columnar vs. row-based system, I overlooked it. Instead, Plattner seems to be arguing that the volume of base-table updates is low enough that — whatever it may be — column-store update overhead is an acceptable price to pay. (At one point he claims that only 5% of the data inserted in a financial application ever gets changed.) That may actually be true in a financial accounting system, but seems more questionable in a sufficiently large application that gets its updates from automatic devices, or from the consumer web.
Other highlights include: Read more
Notes on columnar/TPC-H compression
I was chatting with Omer Trajman of Vertica, and he said that a 70% compression figure for ParAccel’s recent TPC-H filing sounded about right.* When I noted that seemed kind of low, Omer pointed out that TPC-H data is pseudo-random, while real-life data has much more correlation among the values in different columns. E.g., in retail, a customer is likely to consistently shop at the same stores and to put similar items into his shopping basket).
*Omer was involved in Vertica’s TPC-H-data-based load speed benchmark, and is Vertica’s representative to the TPC.
But why does this matter? After all, Vertica compresses one column at a time (unlike, say, Clearpace). Well, the reason is that Vertica — like other column stores — wants to store different columns in the same row order, for obvious benefits in both reading and writing. So, for example, if all the rows that include Gotham City are grouped sequentially, then all the rows mentioning Bruce Wayne are likely to be near each other as well, while none of the rows that mention Clark Kent will be mixed in.
And when a set of consecutive entries has low cardinality, it’s easier to get high levels of compression.
Categories: Benchmarks and POCs, Columnar database management, Data warehousing, Database compression, Vertica Systems | Leave a Comment |
Aster Data enters the appliance game
Aster Data is rolling out a line of nCluster appliances today. Highlights include:
- Configurations ranging from 9 6.25 terabytes to 1 petabyte of user data. (Edit: Here’s the up-to-date data sheet.)
- A $50K “Express Edition” price for <1 terabyte of user data. Unfortunately, that’s the only stated price.
- The option of bundled MicroStrategy.
- “MapReduce” in the name, which suggests something about the positioning — i.e., enterprise decision support, rather than Aster’s usual web/”frontline” emphasis. (Edit: That also fits with Aster’s recent MapReduce-for-.NET announcement.) (Edit: Actual name is Aster MapReduce Data Warehouse Appliance.)
- Claims that because Aster runs effectively on cheaper, more truly “commodity” hardware than competitors, you get more hardware bang for the buck if you buy from Aster.
I don’t have a lot more to add right now, mainly because I wrote at some length about Aster’s non-appliance-specific, non-MapReduce technology and positioning a couple of weeks ago.
Categories: Analytic technologies, Aster Data, Business intelligence, Data warehouse appliances, Data warehousing, Database compression, MapReduce, Pricing | 16 Comments |
The TPC-H benchmark is a blight upon the industry
ParAccel has released a 30,000-gigabtye TPC-H benchmark, and no less a sage than Merv Adrian paid attention. Now, the TPCs may have had some use in the 1990s. Indeed, Merv was my analyst relations contact for a visit to my clients at Sybase around the time — 1996 or so — I was advising Sybase on how to market against its poor benchmark results. But TPCs are worthless today.
It’s not just that TPCs are highly tuned (ParAccel’s claim of “load-and-go” is laughable Edit: Looking at Appendix A of the full disclosure report, maybe it’s more justified than I thought.). It’s also not just that different analytic database management products perform very differently on different workloads, making the TPC-H not much of an indicator of anything real-life. The biggest problem is: Most TPC benchmarks are run on absurdly unrealistic hardware configurations.
For example, if you look at some details, the ParAccel 30-terabyte benchmark ran on 43 nodes, each with 64 gigabytes of RAM and 24 terabytes of disk. That’s 961,124.9 gigabytes of disk, officially, for a 32:1 disk/data ratio. By way of contrast, real-life analytic DBMS with good compression often have disk/data ratios of well under 1:1.
Meanwhile, the RAM:data ratio is around 1:11 It’s clear that ParAccel’s early TPC-H benchmarks ran entirely in RAM; indeed, ParAccel even admits that. And so I conjecture that ParAccel’s latest TPC-H benchmark ran (almost) entirely in RAM as well. Once again, this would illustrate that the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.
More generally — I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.
Categories: Analytic technologies, Benchmarks and POCs, Buying processes, Columnar database management, Data warehousing, Database compression, ParAccel | 96 Comments |
Greenplum update — Release 3.3 and so on
I visited Greenplum in early April, and talked with them again last night. As I noted in a separate post, there are a couple of subjects I won’t write about today. But that still leaves me free to cover a number of other points about Greenplum, including: Read more