May 14, 2009

The secret sauce to Clearpace’s compression

In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.

To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.

Of the top of my head, examples for which this strategy might be particularly successful include:

Denormalized databases
Message stores with lots of header information
Addresses

Categories: Archiving and information preservation, Columnar database management, Database compression, Rainstor

Subscribe to our complete feed!

Comments

8 Responses to “The secret sauce to Clearpace’s compression”

Introduction to Clearpace | DBMS2 -- DataBase Management System Services on May 14th, 2009 1:52 am

[…] and deduping them. I’m still fuzzy on how that all works. (Edit: I subsequently posted an explanation of that […]
Joe Hellerstein on May 14th, 2009 3:03 am

For a good technical discussion of how to trade row- and column-wise schemes for maximum compression, have a look at the IBM work on Blink. In many cases they show you can approach the optimal compression rate (entropy) this way. I like how they cut through marketing fog on columns vs. rows and focus on the technical meat of compression, and the costs of coding/decoding vs. I/O.

See this paper on Blink and their study/survey of various compression methods
Curt Monash on May 14th, 2009 4:04 am

That looks like good stuff, Joe. Thanks!
Joydeep Sen Sarma on May 17th, 2009 4:08 am

This looks pretty interesting – log files are often heavily denormalized (since joins in warehouses are so expensive) – and the multi-column idea could work very well.

I have been playing around with S3 and EC2 lately – one of the things that struck me was that the cost of uploading data can also be non-trivial. Besides – if data is not uploaded in an optimally compressed manner – then the user needs cpu cycles to compress it by renting cpu in the cloud.

I think it would be very interesting if highly efficient compression could be applied right from the moment data originates – all the way to it’s final long term store.
NParchive data compression - the secret sauce | Clearpace Blog on May 20th, 2009 2:54 pm

[…] patents and hundreds of man years of development. However, following some commentary in a post by Curt Monash this week, I thought I’d shed some light on Clearpace’s “secret sauce”. I’ve tried to […]
Andy Ben-Dyke on May 20th, 2009 3:31 pm

To clarify, Clearpace’s underlying technology leverages a tree-based rather than columnar structure that utilizes field and pattern level deduplication. When source data is loaded into NParchive each record is stored as a series of pointers to the location of a single instance of a data value, or pattern of data values. The NParchive data store comprises a tree-based structure that links the various instances of the patterns together to establish the data records. Each record is essentially an independent tree, but each record can share leaves and branches. This approach typically delivers 40:1 compression when combined with the additional algorithmic and byte-level compression techniques employed by NParchive, but means that the original data records can be reconstituted at any time.

NParchive’s tree-based approach provides the advantages of the columnar structure (column-level access and compression) but also allows additional compression to be applied (based upon “patterns” between columns). Furthermore, the tree structure is used for in-memory querying, so the memory footprint is also significantly reduced.

Take a look at this post http://tinyurl.com/qraffr if you want more information on NParchive’s compression techniques.
Sneakernet to the cloud | DBMS2 -- DataBase Management System Services on May 29th, 2009 11:06 pm

[…] sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, […]
The Netezza and IBM DB2 approaches to compression | DBMS 2 : DataBase Management System Services on January 11th, 2011 2:20 pm

[…] Except for the 4096 values limit, that sounds at least as flexible as the Rainstor/Clearpace compression approach. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

The secret sauce to Clearpace’s compression

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin