A data distribution idea at Vertica and Clustrix
Yesterday I wrote:
Clustrix has one cool idea I haven’t heard from anybody else, which I’m calling index distribution. The idea is that each index can be distributed differently across the cluster … i.e. on different distribution keys. Clustrix thinks that paying special attention to index distribution and movement is helpful to the performance of distributed joins.
While that’s true, I thought I’d heard something similar from Vertica; so I checked, and indeed I had. Vertica famously lets you store columns in different sort orders, in both reasonable senses:
- Different columns in a table can be sorted in different ways.
- A single column, which is stored multiple times for usual reasons of replication safety, can be sorted differently in its different copies.
It turns out those columns can also be distributed on different keys as well.
Related link
- Vertica projections explained at length (September, 2011)
Comments
5 Responses to “A data distribution idea at Vertica and Clustrix”
Leave a Reply
Back in the H-Store days, we considered this feature for what became VoltDB as well (some of us were Vertica folks after all).
We decided this was difficult to leverage in always-available high-throughput OLTP workloads. If a machine that has the index you need for a particular query fails, then that query goes from an index scan to a seqential scan. The difference in performance can be as many as 5 orders of magnitude in not unrealistic cases.
To make this feature really useful, you’d have to use non-replicated indexes only for a non-critical part of your application, and you’d have to be able to turn these queries off or throttle them in the face of node failure.
Of course, if you’re replicating just for durability and not for availability, then I suppose you gain some storage efficiency.
Hi John,
Clustrix indexes have replication factors too. Sergei told me reasonable defaults were replication factors of 3 for data, 2 for indexes. But a particular crucial index could have a higher factor.
A typical Clustrix customer has 100s of tables, and an average table will have several indexes (usually more than 2 but less than 10 – but there are outliers of course).
Most of these apps have machine generated queries, with a typical app generating 1,000s of distinct query signatures. That’s a lot of query – index interactions to track.
It would be impractical for most of our customers to identify which sets of indexes are less important than others. And if you get it wrong, some portion of your app becomes broken in a simple failure case. It’s just not worth it to try and save a few writes.
Clustrix has some new technical documentation that describes how data distribution works:
http://docs.clustrix.com/display/CLXDOC/Data+Distribution
[…] stand for now on my previous coverage of Vertica’s database […]