September 22, 2008

Database compression is heavily affected by the kind of data

I’ve written often of how different kinds or brands of data warehouse DBMS get very different compression figures. But I haven’t focused enough on how much compression figures can vary among different kinds of data. This was really brought home to me when Vertica told me that web analytics/clickstream data can often be compressed 60X in Vertica, while at the other extreme — some kind of floating point data, whose details I forget for now — they could only do 2.5X. Edit: Vertica has now posted much more accurate versions of those numbers. Infobright’s 30X compression reference at TradeDoubler seems to be for a clickstream-type app. Greenplum’s customer getting 7.5X — high for a row-based system — is managing clickstream data and related stuff. Bottom line:

When evaluating compression ratios — especially large ones — it is wise to inquire about the nature of the data.

Categories: Data warehousing, Database compression, Greenplum, Infobright, Vertica Systems, Web analytics

Subscribe to our complete feed!

Comments

4 Responses to “Database compression is heavily affected by the kind of data”

Neil Raden on September 22nd, 2008 4:00 pm

Curt,

I’d love to get a better definition of compression. When they say 60X compression, do they really mean 60x of all of the information with no data loss? In reality, weblog and clickstream data is full of junk that doesn’t get integrated. I haven’t done this myself in a few years, but I seem to recall that we stripped out and analyzed only a small faction of each record.

-NR
Curt Monash on September 22nd, 2008 4:18 pm

Neil,

I’m pretty sure it’s lossless. That’s always what’s meant in talking of compression of other kinds of data, and I don’t know why this kind would be an exception.

CAM
Vertica finally spells out its compression claims | DBMS2 -- DataBase Management System Services on September 24th, 2008 5:54 am

[…] It’s clear what Omer means by most of those categories from reading the post, but I’m a little fuzzy on what “Consumer Data” or “Marketing Analytics” comprise in his taxonomy. Anyhow, Omer’s post is a huge improvement over my recent one — based on a conversation with Omer — which featured some far less accurate or complete compression numbers. […]
Infology.Ru » Blog Archive » Оценивая КПД системы хранения: какую долю объема системы хранения занимают данные пользователя on October 21st, 2008 5:14 pm

[…] Сжатие данных. Я много писал о компрессии данных. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Database compression is heavily affected by the kind of data

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin