Vertica Systems
Analysis of columnar data warehouse DBMS vendor Vertica Systems. Related subjects include:
How immediate consistency works
This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.
Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:
- Two-phase commit (2PC)
- Read-your-writes (RYW) consistency
- Prudent optimism 🙂
I shall explain.
Two-phase commit has been around for decades. Its core idea is:
- One node commands other nodes (and perhaps itself) to write data.
- The other nodes all reply “Aye, aye; we are ready and able to do that.”
- The first node broadcasts “Make it so!”
Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.
But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC: Read more
Categories: Aster Data, Clustering, Hadoop, HBase, IBM and DB2, Netezza, NoSQL, Teradata, Vertica Systems | 11 Comments |
In-database analytics — analytic glossary draft entry
This is a draft entry for the DBMS2 analytic glossary. Please comment with any ideas you have for its improvement!
Note: Words and phrases in italics will be linked to other entries when the glossary is complete.
“In-database analytics” is a catch-all term for analytic capabilities, beyond standard SQL, running on the same machine as and under the management of an analytic DBMS. These can run in one or both of two modes:
- In-process or unfenced, i.e. in the same process as the DBMS itself. This option gives maximum performance, but any defects in the analytic code may crash the whole DBMS. Also, it generally requires that the code be in the same language as the DBMS, i.e. C++.
- Out-of-process or fenced, i.e. in a separate process. This option sacrifices performance, in favor of reliability and language flexibility.
In-database analytics may offer great performance and scalability advantages versus the alternative of extracting data and having it be processed on a separate server. This is particularly likely to be the case in MPP (Massively Parallel Processing) analytic DBMS environments.
Examples of in-database analytics include:
- Creating temporary data structures that persist past the life of a query.
- Creating temporary data structures that are non-tabular.
- Predictive modeling that uses all the same nodes in an MPP cluster where the data resides.
- Predictive analytics (scoring only).
Other common domains for in-database analytics include sessionization, time series analysis, and relationship analytics.
Notable products offering in-database analytics include:
- Teradata Aster SQL/MR.
- Multiple other analytic platforms, such as Sybase IQ, Vertica, or IBM Netezza. Indeed, in-database analytics are a defining feature of analytic platforms.
- Fuzzy Logix (for predictive analytics).
Categories: Analytic glossary, Aster Data, Data warehousing, IBM and DB2, MapReduce, Netezza, Parallelization, Predictive modeling and advanced analytics, Sybase, Teradata, Vertica Systems | 8 Comments |
Analytic platform — analytic glossary draft entry
This is a draft entry for the DBMS2 analytic glossary. Please comment with any ideas you have for its improvement!
Note: Words and phrases in italics will be linked to other entries when the glossary is complete.
In our usage, an “analytic platform” is an analytic DBMS with well-integrated in-database analytics, or a data warehouse appliance that includes one. The term is also sometimes used to refer to:
- Any analytic DBMS or data warehouse appliance.
- Other kinds of software, or software/hardware combination, that support broad analytic capabilities.
To varying extents, most major vendors of analytic DBMS or data warehouse appliances have extended their products into analytic platforms; see, for example, our original coverage of analytic platform versions of as Aster, Netezza, or Vertica.
Related posts
- Our original definition of “analytic platform” (February, 2011)
- Our original feature list for analytic platforms (January, 2011)
Categories: Analytic glossary, Aster Data, Data warehouse appliances, Data warehousing, Netezza, Vertica Systems | 3 Comments |
Notes on some basic database terminology
In a call Monday with a prominent company, I was told:
- Teradata, Netezza, Greenplum and Vertica aren’t relational.
- Teradata, Netezza, Greenplum and Vertica are all data warehouse appliances.
That, to put it mildly, is not accurate. So I shall try, yet again, to set the record straight.
In an industry where people often call a DBMS just a “database” — so that a database is something that manages a database! — one may wonder why I bother. Anyhow …
1. The products commonly known as Oracle, Exadata, DB2, Sybase, SQL Server, Teradata, Sybase IQ, Netezza, Vertica, Greenplum, Aster, Infobright, SAND, ParAccel, Exasol, Kognitio et al. all either are or incorporate relational database management systems, aka RDBMS or relational DBMS.
2. In principle, there can be difficulties in judging whether or not a DBMS is “relational”. In practice, those difficulties don’t arise — yet. Every significant DBMS still falls into one of two categories:
- Relational:
- Was designed to do relational stuff* from the get-go, even if it now does other things too.
- Supports a lot of SQL.
- Non-relational:
- Was designed primarily to do non-relational things.*
- Doesn’t support all that much SQL.
*I expect the distinction to get more confusing soon, at which point I’ll adopt terms more precise than “relational things” and “relational stuff”.
3. There are two chief kinds of relational DBMS: Read more
Some Vertica 6 features
Vertica 6 was recently announced, and so it seemed like a good time to catch up on Vertica features. The main topics I want to address are:
- External tables and the associated new Hadoop connector.
- Online schema evolution.
- Workload management.
Also:
- I have some tidbits to add to my June, 2011 coverage of Vertica’s analytic functionality.
- I’ll stand for now on my previous coverage of Vertica’s database organization.
In general, the main themes of Vertica 6 appear to be:
- Enterprise/SaaS-friendliness, high uptime, and so on.
- Improved analytic usefulness.
Let’s do the analytic functionality first. Notes on that include:
- Vertica has extended its user-defined function/analytic procedure/whatever functionality to include user-defined load. (Same SDK, different specific classes.)
- One of the languages Vertica supports is R. But for now, parallel R is limited to “Of course, you can run the same functions and procedures on many nodes at once.”
- Based on community activity around bugs and so on, it seems there are users for Vertica’s JSON-based Twitter sentiment analysis plug-in.
I’ll also take this opportunity to expand on something I wrote about a few vendors — including Vertica — at the end of my post on approximate query results. When I probed how customers of Vertica and other RDBMS-based analytic platform vendors used vendor-proprietary advanced analytic SQL and other analytic capabilities, answers included: Read more
The eternal bogosity of performance marketing
Chris Kanaracus uncovered a case of Oracle actually pulling an ad after having been found “guilty” of false advertising. The essence seems to be that Oracle claimed 20X hardware performance vs. IBM, based on a comparison done against 6 year old hardware running an earlier version of the Oracle DBMS. My quotes in the article were:
- “Everybody’s guilty of that kind of exaggeration.”
- “Oracle tends to be even a little guiltier than others.”
- “If your new system can’t outperform somebody else’s old system by a huge factor on at least some queries, you’re doing something wrong.”
- “Use newer, better hardware; use newer, better software; have a top sales engineer do a great job of tuning it and of course you’ll see huge performance results.”
Another example of Oracle exaggeration was around the Exadata replacement of Teradata at Softbank. But the bogosity flows both ways. Netezza used to make a flat claim of 50X better performance than Oracle, while Vertica’s standard press release boilerplate long boasted
50x-1000x faster performance at 30% the cost of traditional solutions
Of course, reality is a lot more complicated. Even if you assume apples-to-apples comparisons in terms of hardware and software versions, performance comparisons can vary greatly depending upon queries, databases, or use cases. For example:
- Many queries are inherently much faster over columnar storage than over row-based.
- Different data sets respond very differently to various compression algorithms.
- Some analytic RDBMS can maintain strong performance at high levels of concurrent usage. Some can’t.
- Some queries that run very fast on one DBMS without tuning might require careful tuning in another system.
- Some DBMS scale out much better than others.
- Vendors optimize for different usage assumptions, which may or may not apply in your particular case.
And so, vendor marketing claims about across-the-board performance should be viewed with the utmost of suspicion.
Related links
Categories: Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Exadata, Netezza, Oracle, Vertica Systems | Leave a Comment |
A data distribution idea at Vertica and Clustrix
Yesterday I wrote:
Clustrix has one cool idea I haven’t heard from anybody else, which I’m calling index distribution. The idea is that each index can be distributed differently across the cluster … i.e. on different distribution keys. Clustrix thinks that paying special attention to index distribution and movement is helpful to the performance of distributed joins.
While that’s true, I thought I’d heard something similar from Vertica; so I checked, and indeed I had. Vertica famously lets you store columns in different sort orders, in both reasonable senses:
- Different columns in a table can be sorted in different ways.
- A single column, which is stored multiple times for usual reasons of replication safety, can be sorted differently in its different copies.
It turns out those columns can also be distributed on different keys as well.
Related link
- Vertica projections explained at length (September, 2011)
Categories: Clustering, Clustrix, Columnar database management, Vertica Systems | 5 Comments |
Approximate query results
In theory:
- A database query is a predicate.
- A DBMS matches the data it manages against the predicate and send back those records for which the predicate is true.
And so it would seem that query results always have to be exact. Even so, there are at least four different practical scenarios in which query results can reasonably be regarded as approximate, each associated with query languages that can supersede standard set-theoretic SQL.
Actually, there’s a fifth, and it’s a huge one — some fraction of your data is just plain wrong. But that’s not what this post is about.
First, some queries don’t have binary results, even in principle. Notably, text queries are answered via relevancy rankings, which fit badly into the relational model.
Second — and this can be combined with the first — you might want to generalize the query to look for partial matches. For example, Yarcdata suggested to me a scenario in which:
- You do a SPARQL query.
- You modify the query to accept results higher up in the taxonomy. (Which is likely to be possible, because where there’s SPARQL, there’s apt to be a taxonomy as well.) For example, if you really want to query on two people living in the house, you might extend the query to cover two people connected by any kind of address or building.
Similarly, if you’re looking for geographic proximity, it’s common to extend the allowed radius to fish for more results. Or one can walk up the hierarchy in a dimensional model.
Third, sometimes you just don’t have the data for any kind of precise answer at all. One adaptation I’ve mentioned before is to interpolate time series with synthetic data, and send back “precise” results based on that. In the same post I mentioned the Vertica “range join”, wherein users deliberately throw away part of their data — only storing the range it was in — and then join accordingly.
As Donald Rumsfeld might have said — and would have done well to reflect upon — you go into decision-making with the data you have, not the data you wish you had.
Finally, sometimes there’s a precise answer in principle, but for performance reasons you accept an approximate one, at least to start with. Numerous companies have told me stories around this, including:
- Infobright, whose “Rough Query” gives fast approximate results to a broad range of queries.
- Metamarkets, which does fast cardinality estimates via HyperLogLog.
- Aster Data, which was the first company to point out to me that median, decile, quintile, and so on calculations are a lot faster in a shared-nothing setting if you’re willing to settle for approximate results.
The latter two categories led me to ask vendors how customers actually make use of their exotic SQL capabilities. Answers boiled down to:
- (Always) Well, there’s a lot of custom coding.
- (Sometimes) We’re working with partner BI vendors to make direct use of the capabilities, but that’s not done yet, so it’s too early to talk about any details.
Perhaps the answers will never get much better; it’s tough to get packaged software vendors to support vendor-specific SQL, unless the vendor is Oracle. Even so, we’re seeing ever more ways in which conventional SQL DBMS are being superseded by data management and analytic alternatives.
Categories: Aster Data, Business intelligence, Data models and architecture, Data warehousing, Database compression, Infobright, Text, Vertica Systems, Yarcdata and Cray | 3 Comments |
Our clients, and where they are located
From time to time, I disclose our vendor client lists. Another iteration is below, the first since a little over a year ago. To be clear:
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
Hope for a new PostgreSQL era?
In a comedy of briefing errors, I’m not too clear on the details of my client salesforce.com’s new PostgreSQL-as-a-service offering, nor exactly on what my clients at VMware are bringing to the PostgreSQL virtualization/cloud party. That said:
- PostgreSQL is good technology.
- MySQL is narrowing the gap, but PostgreSQL is still ahead of MySQL in some ways. (Database extensibility if nothing else.)
- PostgreSQL has a lot of users. (Many of them in academia and/or Russia.)
- Neither EnterpriseDB (which now calls itself “The enterprise PostgreSQL company”) nor the PostgreSQL community leadership have covered themselves with stewardship glory.
- A significant number of interesting DBMS products can be regarded as PostgreSQL forks (e.g. Greenplum, Aster Data nCluster, Netezza if you squint, and Vertica if you stand on your head*).
- PostgreSQL advancement is not dead. For example, Hadapt beta users are running actual PostgreSQL on many nodes each.
- There’s no assurance that Oracle will be a benevolent MySQL steward forever. (Specifically, Oracle’s “Play nicely with others” antitrust commitments expire in 2014.)
So I think it would be cool if one or the other big company put significant wood behind the PostgreSQL arrow.
*While Vertica was originally released using little or no PostgreSQL code — reports varied — it featured high degrees of PostgreSQL compatibility.