The secret sauce to Clearpace’s compression
In an introduction to archiving vendor Clearpace last December, I noted that Clearpace claimed huge compression successes for its NParchive product (Clearpace likes to use a figure of 40X), but didn’t give much reason that NParchive could compress a lot more effectively than other columnar DBMS. Let me now follow up on that.
To the extent there’s a Clearpace secret sauce, it seems to lie in NParchive’s unusual data access method. NParchive doesn’t just tokenize the values in individual columns; it tokenizes multi-column fragments of rows. Which particular columns to group together in that way seems to be decided automagically; the obvious guess is that this is based on estimates of the cardinality of their Cartesian products.
Of the top of my head, examples for which this strategy might be particularly successful include:
- Denormalized databases
- Message stores with lots of header information
- Addresses
Comments
8 Responses to “The secret sauce to Clearpace’s compression”
Leave a Reply
[…] and deduping them. I’m still fuzzy on how that all works. (Edit: I subsequently posted an explanation of that […]
For a good technical discussion of how to trade row- and column-wise schemes for maximum compression, have a look at the IBM work on Blink. In many cases they show you can approach the optimal compression rate (entropy) this way. I like how they cut through marketing fog on columns vs. rows and focus on the technical meat of compression, and the costs of coding/decoding vs. I/O.
See this paper on Blink and their study/survey of various compression methods
That looks like good stuff, Joe. Thanks!
This looks pretty interesting – log files are often heavily denormalized (since joins in warehouses are so expensive) – and the multi-column idea could work very well.
I have been playing around with S3 and EC2 lately – one of the things that struck me was that the cost of uploading data can also be non-trivial. Besides – if data is not uploaded in an optimally compressed manner – then the user needs cpu cycles to compress it by renting cpu in the cloud.
I think it would be very interesting if highly efficient compression could be applied right from the moment data originates – all the way to it’s final long term store.
[…] patents and hundreds of man years of development. However, following some commentary in a post by Curt Monash this week, I thought I’d shed some light on Clearpace’s “secret sauce”. I’ve tried to […]
To clarify, Clearpace’s underlying technology leverages a tree-based rather than columnar structure that utilizes field and pattern level deduplication. When source data is loaded into NParchive each record is stored as a series of pointers to the location of a single instance of a data value, or pattern of data values. The NParchive data store comprises a tree-based structure that links the various instances of the patterns together to establish the data records. Each record is essentially an independent tree, but each record can share leaves and branches. This approach typically delivers 40:1 compression when combined with the additional algorithmic and byte-level compression techniques employed by NParchive, but means that the original data records can be reconstituted at any time.
NParchive’s tree-based approach provides the advantages of the columnar structure (column-level access and compression) but also allows additional compression to be applied (based upon “patterns” between columns). Furthermore, the tree structure is used for in-memory querying, so the memory footprint is also significantly reduced.
Take a look at this post http://tinyurl.com/qraffr if you want more information on NParchive’s compression techniques.
[…] sending data to the cloud, you probably want to compress it to the max before sending. Clearpace’s new RainStor structured-data archiving service emphasizes that idea. RainStor marketing says cloud, […]
[…] Except for the 4096 values limit, that sounds at least as flexible as the Rainstor/Clearpace compression approach. […]