Actian Vector Hadoop Edition
I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.
That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call. 🙂
In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that
Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,
… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.
and
Actian … is now “one company, with one voice and one platform” according to its John Santaferraro
The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.
All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include:Â
- Vectorwise, while being proudly multi-core, was previously single-server. The new Vector Hadoop Edition is the first version with node parallelism.
- Actian’s Vector Hadoop edition uses HDFS (Hadoop Distributed File System) and YARN to manage an Actian-proprietary file format. There is currently no interoperability whereby Hadoop jobs can read these files. However …
- … Actian’s Vector Hadoop edition relies on Hadoop for cluster management, workload management and so on.
- Peter thinks there are two paying customers, both too recent to be in production, who between then paid what I’d call a remarkable amount of money.*
- Roadmap futures* include:
- Being able to update and indeed trickle-update data. Peter is very proud of Vectorwise’s Positional Delta Tree updating.
- Some elasticity they’re proud of, both in terms of nodes (generally limited to the replication factor of 3) and cores (not so limited).
- Better interoperability with Hadoop.
Actian actually bundles Vector Hadoop Edition with DataFlow — the old Pervasive DataRush — into what it calls “Actian Analytics Platform – Hadoop SQL Edition”. DataFlow/DataRush has been working over Hadoop since the latter part of 2012, based on a visit with my then clients at Pervasive that December.
*Peter gave me details about revenue, pipeline, roadmap timetables etc. that I’m redacting in case Actian wouldn’t like them shared. I should say that the timetable for some — not all — of the roadmap items was quite near-term; however, pay no attention to any phrasing in Peter’s blog post that suggests the roadmap features are already shipping.
The Actian Vector Hadoop Edition optimizer and query-planning story goes something like this:
- Vectorwise started with the open-source Ingres optimizer. After a query is optimized, it is rewritten to reflect Vectorwise’s columnar architecture. Peter notes that these rewrites rarely change operator ordering; they just add column-specific optimizations, whatever that means.
- Now there are rewrites for parallelism as well.
- These rewrites all seem to be heuristic/rule-based rather than cost-based.
- Once Vectorwise became part of the Ingres company (later renamed to Actian), they had help from Ingres engineers, who helped them modify the base optimizer so that it wasn’t just the “stock” Ingres one.
As with most modern MPP (Massively Parallel Processing) analytic RDBMS, there doesn’t seem to be any concept of a head-node to which intermediate results need to be shipped. This is good, because head nodes in early MPP analytic RDBMS were dreadful bottlenecks.
Peter and I also talked a bit about SQL-oriented HDFS file formats, such as Parquet and ORC. He doesn’t like their lack of support for columnar compression. Further, in Parquet there seems to be a requirement to read the whole file, to an extent that interferes with Vectorwise’s form of data skipping, which it calls “min-max indexing”.
Frankly, I don’t think the architectural choice “uses Hadoop for workload management and administration” provides a lot of customer benefit in this case. Given that, I don’t know that the world needs another immature MPP analytic RDBMS. I also note with concern that Actian has two different MPP analytic RDBMS products. Still, Vectorwise and indeed all the stuff that comes out Martin Kersten and Peter’s group in Amsterdam has always been interesting technology. So the Actian Vector Hadoop Edition might be worth taking a look at before you redirect your attention to products with more convincing track records and futures.
Comments
4 Responses to “Actian Vector Hadoop Edition”
Leave a Reply
Hi Curt,
Thanks for the write-up! Two minor comments.
In response of the last paragraph, I want to note that Vector(wise) has been now a decade in development, and has been in production for years at more than a 100 customers, and I would almost start to call it “mature”. But sure, the Hadoop Edition is new and brings new challenges, but it is not starting from zero.
Final point is that I would like to advertise that Vector Hadoop Edition in my opinion is really really fast – I think it is the fastest SQL-on-Hadoop system out there by some margin.
It beats Impala on its own benchmark of choice by a factor 14 on the same hardware. If you are interested to read more on the performance side, I just published an article on that on the little blog I have started with Thomas Neumann of TUM (databasearchitects.blogspot.com):
http://databasearchitects.blogspot.com/2014/08/tpc-ds-with-vector-hadoop-edition.html
enjoy!
Peter Boncz
Curt,
Regarding your “concern that Actian has two different MPP analytic RDBMS products” …
… The way it was explained by Actian at their Boulder Brains Trust briefing was that ParAccel/Matrix is an additional purchase (the “Extreme Performance Edition”) if you need “low latency, very high performance analytics”.
Should we deduce that this other new thing has high latency and so-so performance that only the purchase of an another product will mitigate ?
Graham,
I wouldn’t take any of those positioning statements too seriously from a company that wants to pretend totally different products are somehow the same thing. This was a disaster when Informix pushed the “one code line” inaccuracy (that’s euphemism) in the 1990s, and it’s unlikely to work well for Actian now.
Regarding this part:
“Further, in Parquet there seems to be a requirement to read the whole file”
This is incorrect. One of the main goals is to read columns independently so it would be silly to require reading the entire file.
I’m happy to clarify any point that would need to be.
Parquet already implements data skipping based on min and max with predicate push down. Further improvements are coming. Building additional indexes around Parquet should be straightforward.