Vertica 7
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
Also, be warned that there are two entirely different key-value things going on in Vertica 7. I was pretty confused until I realized that.
Vertica Flex Zone basics include:
- Flex Zone is targeted at data that originates in, for example, a log file or a NoSQL DBMS.
- Flex Zone data can be stored in a Vertica construct called Flex Tables. It can also be accessed externally, but then of course performance is hampered by what amounts to a load operation for each query.
- Flex Zone data is always in a map datatype (i.e., lots of key-value pairs).
- Vertica automagically creates virtual columns on Flex Zone data. Virtual columns can be accessed (read-only) by SQL in the usual way, DML and DDL alike (Data Manipulation/Description Language). So in particular, business intelligence tools treat virtual columns the same way they’d treat real ones.
- Flex Zone virtual columns can drill into nested data structures.
- If you retrieve a virtual column, you retrieve the rest of the record (log entry, JSON document, whatever) with it. However …
- … virtual column data can be copied into ordinary Vertica physical column swith no change in SQL access; Vertica will redirect queries for you appropriately. At that point you get customary Vertica performance.
- Vertica (the HP division) points out that loading data into Flex Zone can be much faster than loading it into Vertica Classic.
- Vertica (the product), which is priced by data volume, costs much less for Flex Zone than for standard columnar data.
Basically, Flex Zone is meant to be (among other things) a big bit bucket, perhaps in some cases obviating the need for Hadoop to play the same role.
I have less detail on the new short-request query executor, but I gather that:
- It assumes a query can be resolved on a single Vertica node. (Paradigmatic example: single-row lookup.)
- It involves client code that predicts which Vertica node can resolve the query.
- It has a key-value style interface, even though …
- … what is sent to the Vertica cluster is SQL.
- A SQL interface is planned.
I assume this will eventually evolve to the point that you can join a small, broadcasted dimension table to a single node’s portion of a fact table, but Vertica hasn’t actually told me that that kind of functionality is in the works.
Finally, and as is appropriate for a whole-number release, Vertica 7 has a lot of different performance enhancements, in loads, joins, and more. In particular, workload management has been extended from covering just RAM (which is usually Vertica’s scarcest commodity anyhow) to, in a limited sense, CPU as well. Specifically, queries can be “pinned” to specific cores, which for example lets short-request workloads be isolated from their longer-running brethren.
Related link
- In-database analytics were first added in Vertica 5.
Comments
4 Responses to “Vertica 7”
Leave a Reply
Thanks for the insightful post.
Have you heard anything about their “SQL-on-Hadoop” offering? I’m a bit skeptical whether it’s really SQL-On-Hadoop. Do they store their data in HDFS? Do they use Hadoop nodes to do the processing? Of course not MapReduce. But, like Impala or Presto, still really run on Hadoop? Or do they have a connection with Hadoop and do everything on their own nodes?
That’s a big difference in my opinion. The first option will slow down Vertica because the way HDFS is built. The second option is not really SQL-On-Hadoop.
Vertica does its query execution on its own nodes. I don’t like the label “SQL-on-Hadoop” for that. I’d rather call it SQL/Hadoop integration, of which SQL-on-Hadoop is a particular kind that Vertica doesn’t happen to offer.
E.g. http://www.dbms2.com/2012/10/17/hadooprdbms-integration-aster-sql-h-and-hadapt/
[…] multiple-data-models idea has been extended into schema-on-need, which is sometimes but not always housed in […]
[…] in innovations relevant to log analysis, including a range of time series/event series features and its own schema-on-need effort. Vertica was also founded by people who were also streaming pioneers (there were heavily […]