Vertica — just star and snowflake schemas?
One of the longest-running technotheological disputes I know of is the one pitting flat/normalized data warehouse architectures vs. cubes, stars, and snowflake schemas. Teradata, for example, is a flagwaver for the former camp; Microstrategy is firmly in the latter. (However, that doesn’t keep lots of retailers from running Microstrategy on Teradata boxes.) Attensity (a good Teradata partner) is in the former camp; text mining rival Clarabridge (sort of a Microstrategy spinoff) is in the latter. And so on.
Vertica is clearly in the star/snowflake camp as well. I asked them about this, and Vertica’s CTO Mike Stonebraker emailed a response. I’m reproducing it below, with light edits; the emphasis is also mine. Key points include:
- Almost everybody (that Vertica sees) wants stars and snowflakes, so that’s what Vertica optimizes for.
- Replicating small dimension tables across nodes is great for performance.
- Even so, Vertica is broadening its support for more general schemas as well.
Great question. This is something that we’ve thought a lot about and have done significant research on with large enterprise customers. … short answer is as follows:
Vertica supports star and snowflake schemas because that is the desired data structure for data warehousing. The overwhelming majority of the schemas we see are of this form, and we have highly optimized for this case.
This includes horizontally partitioning the fact table and replicating the dimension tables. This generates query plans with maximum parallelism and minimum execution time.
There are occasional non-snowflake schemas. For example, one customer insisted on running Vertica on the identical schema he was using for his current row store. To help his row-store perform better, he had split his fact table in half,* and used what we call a “barbell” schema, and he did not want to go to the effort of changing the design. Obviously, there is no penalty for “fat” fact tables in a column store. Hence the customer’s kludge was row-store specific. Nevertheless, he wanted to use his existing schema.
*Ed: I assume this means that different columns were stored in different tables, rather than that the rows were split up. Otherwise it wouldn’t really make sense.
For this reason, we are now extending the Vertica engine to support non-star/snowflake schemas, and will deliver this functionality in a few months.
In a similar vein, the overwhelming percentage of the schemas we see have an enormous fact table and (by comparison) tiny dimension tables (made even tinier by Vertica’s aggressive use of data compression). Therefore, replicating the dimension tables to improve query performance is the prudent thing to do — pay a little space to get vastly better query performance.
According to our customers, the goal is superior performance on star/snowflake schemas, and that is what Vertica delivers.
Comments
6 Responses to “Vertica — just star and snowflake schemas?”
Leave a Reply
[…] distributed system replicates the dimension tables on each node and partitions the fact table. Vertica says that they have customers that use more transactional models, but what does that mean for overall performance? Greenplum’s website says: “Utilizes […]
[…] игрок по критерию U, если взять в расчёт их фокус на схемах с одной таблицей фактов. Но они уже на пути к высокому уровню по критерию […]
[…] Автор: Curt Monash Дата публикации оригинала: 2007-10-23 Источник: Блог Курта Монаша […]
[…] has finally cleaned up all vestiges of its prior bias to star schemas. For example, Vertica concedes that its product previously would sometimes force a star execution […]
[…] industry experience notwithstanding — most analytic RDBMS users would be content with star schemas. They were […]
[…] *When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend. […]