Oracle and Vertica on compression and other physical data layout features
In my recent post on Exadata pricing, I highlighted the importance of Oracle’s compression figures to the discussion, and the uncertainty about same. This led to a Twitter discussion featuring Greg Rahn* of Oracle and Dave Menninger and Omer Trajman of Vertica. I also followed up with Omer on the phone.
*Guys like Greg Rahn and Kevin Closson are huge assets to Oracle, which is absurdly and self-defeatingly unhelpful through conventional public/analyst relations channels.
Six key tweets by Greg Rahn said:
I think the HCC 10x compression is a slideware (common) number. Personally I’ve seen it in the 12-17x range on customer data…
This was on a dimensional model. Can’t speak to the specific industry. I do believe Oracle is working on getting industry #s.
As far as I know, Exadata HCC uses a superset of compression algorithms that the commonly known column stores use…
…and it doesn’t require the compression type be in the DDL like Vertica or ParAccel. It figures out the best algo to apply.
The compression number I quoted is sizeof(uncompressed)/sizeof(hcc compressed). No indexes were used in this case.
Exadata HCC is applicable for bulk loaded (fact table) data, so a significant portion (size wise) of most DWs.
Summing up, that seems to say:
- Oracle claims 12-17X compression on a kind of data similar to that on which Vertica — which also uses 10X as a single-point overall compression marketing estimate where needed — claims 20X.
- Oracle selects compression algorithms automagically.
- Oracle’s compression doesn’t quite apply to all the data. Actually, this may be more of an issue for the caching benefits of compression than for the I/O or disk storage gains. (If you join a retail transaction fact table to a customer dimension table, and you have a lot of customers, fitting the uncompressed customer table into RAM could be problematic.)
Omer and I happened to have a call scheduled to discuss MapReduce yesterday evening, but wound up using most of the time to talk about Vertica’s compression and physical layout features instead. Highlights included:
- Greg, like many Vertica competitors, was wrong about Vertica requiring manual, low-level DDL (Data Description Language) for — well, for much of anything. Vertica does all that automatically, at least in theory, and suggests that in real life you can indeed often get by without manual intervention.
- Vertica can do trickle feeds into its compressed columnar storage. Greg seemed to suggest Oracle Exadata can not. (However, I won’t be surprised if, when his comments are expanded to more than 140 characters, he winds up saying the opposite. 🙂 )
- Omer characterized the lowest latency with which you can get data into Vertica and have it be available for query as “seconds”, vs. “minutes” for other columnar vendors.
- Vertica recommends often keeping multiple copies of a column, for high availability and/or performance. This is not directly reflected in compression estimates. In particular, if you’re going to keep redundant copies of data for data-safety reasons anyway, Vertica recommends that you:
- Run queries against more than one copy of the data, for performance/throughput.
- Store different copies of the columns in different sort orders — e.g., according to different likely join keys — so that the copies are optimized for performance on different classes of queries.
- Vertica doesn’t have indexes.
- Vertica sorts columns on ingest. This sorting is, of course, commonly based on attributes from columns other than the one being sorted. Even so, Omer maintains that sorting helps compression, because of the correlation between columns. Examples (and I didn’t get these all from him) might include:
- City/postal code
- Customer_ID/store location
- Customer_ID/product_ID
- Product_ID/price
- Vertica, based on the recent introduction of FlexStore, has an ILM (Information Lifecycle Management) story much like Sybase IQ’s. E.g., you can keep different data ranges for different columns on fast storage, while the rest of the data is relegated to slower/cheaper equipment.
Comments
14 Responses to “Oracle and Vertica on compression and other physical data layout features”
Leave a Reply
@Curt
I would disagree that I am wrong. Perhaps not 100% correct, but let me explain. With Vertica there seems to be two ways to “get” compression. The first is not to specify the type which then gives the default of “auto” which seems to be either LZO or delta depending on the data type. All the other types (RLE, dictionary, etc) need to be specified in the DDL for each column by hand, or via generated DDL by the use of an external tool that looks at the data, queries or both. (See my Twitter conversation with Dave Messenger). ParAccel also has an external tool to suggest/generate the compression type per column in the DDL called Compalyze.
My point was that with Exadata Hybrid Columnar Compression the user specifies only the compression level (for query low, for query high, or archive) on a table or partition and internally the optimal encoding/compression algorithms are determined on a per column basis with no prior knowledge of the data or queries.
I hope that clarifies my comment.
Greg,
How do you get from Dave Menninger’s
to your
?
@Curt
I got there via other sources of information, but I’ll defer to Dave or someone else from Vertica clarify if my interpretation of the information is correct or not and if the Vertica database software can indeed choose the any compression type for columns based on data or queries or both after the physical projection has been created and if the “auto” mode is more sophisticated than if this data type then that compression type (implicitly done at projection creation time).
If you read through the FDR for Vertica’s ETL Database Load Benchmark you can see that the projection DDL explicitly contains encoding algorithms for some of the columns (as well as segmentation). I would have to assume that this was done either by 1) a person or 2) a external tool knowing something about the data. If it were truly done automatically inside the database, then there should be no real reason to explicitly use any encoding declarations in the DDL.
@Greg,
Vertica absolutely gives a manual tuning option.
However, as I understand from the company it doesn’t REQUIRE manual tuning.
Neither of those should be surprising. The devil, of course, is in the details — how much manual tuning is typically called for.
Vertica’s automatic tuning does indeed recommend various encoding types including RLE. It can run at any point – prior to, during or after load and after workloads or as data change over time.
Manual tuning is always available and we chose to use manual tuning for the ETL benchmark.
@Omar
By “automatic tuning” you mean running the external admin tool versus the default or auto choice of the compression type for a column in the DDL, correct? As I understand it, the automation is in the recommendation and the optional implementation of that recommendation (for projection choice and column encoding type). It still requires that a DBA perform those tasks, which are not necessarily difficult tasks, but none the less tasks that need to be done.
@Greg,
I happen to be on the phone w/Omer again at the moment. Short answer to your comments is that your news about Vertica seems several releases old.
@Curt
I (and surely other readers) would be interested in the long answer. I will say that I am doubtful that my information is several releases old. IIRC it is based on version 3.0 which at this point may be several releases old, but probably less than 6 months old.
@Greg,
Fair enough.
On the other hand, Omer has already posted saying what Vertica does. There’s only so much you can reasonably demand of a competitor in outlining what their prior releases, by way of contrast, did not do.
@Curt
I agree – I’m not trying to dwell on the past so an answer on how it works in the current (whatever number that is or will be) release is acceptable.
FWIW my info seems to agree with what Seth Grimes has in this paper:
http://www.sybase.com/files/White_Papers/Sybase-IQ-Competitive-Assessment-070209-WP.pdf
@Greg,
Evidently, you think Vertica is lying through its teeth as to what its products do or don’t do. (This would by no means, of course, be the first time a vendor has done such a thing …)
I have a simple suggestion, then. Vertica’s product is freely available for download. Why don’t you download it, check it out for yourself, and blog about what you discover?
@Greg,
As for Seth’s paper — I like the guy, so I won’t post anything more about it beyond my original harsh words on Twitter.
[…] eschew RAID, instead doing mirroring in its own software. (Other examples of this strategy would be Vertica, Oracle Exadata/ASM, and Teradata Fallback.) Prior to nCluster 4.0, this caused a problem, in that […]
I think , compression in exadata …just simply forget it ..if u compress then it will be slow …
Vertica on other hand use columnar and as explained above … it dose the job vey well..u can compare both u and see oracle gets killed…
Overall in compression and query speed with same hardware and data volume …Vertica is the fastest and oracle in the slowest in the market right now …