Vertica’s version of MapReduce integration
I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:
- By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
- Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
- Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.
Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically:
- One or more Vertica customers are using MapReduce in production to do relatively simple transforms of web log data
- Vertica customers are experimenting with — but have not yet put into production — more sophisticated pattern analysis of web log data.
- Financial services customers are using MapReduce for a lot of experimentation in discovering new algorithms. The idea is that DBMS/MapReduce integration offers rapid prototyping of algorithmic ideas. Those that pan out are then reimplemented for production, presumably in some kind of CEP (Complex Event Processing) system. These users seem to be ones that are pushing down a lot of Map functions to the Vertica DBMS.
By the way, Vertica is based on C-Store, the Ph.D. thesis project of Daniel Abadi, who recently wrote:
To me, it is far more efficient from a performance and a “green” perspective to push the computation to the data. Hence, I am not a fan of decoupling the compute grid and the data grid.
Not coincidentally, Daniel also recently wrote that
If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.
The VectorWise guys also told me they are looking forward to seeing how the two projects work together.
Comments
5 Responses to “Vertica’s version of MapReduce integration”
Leave a Reply
One clarification regarding compute/data locality. MR necessarily has a data re-distribution phase prior to reduce (unless data is distributed by map key on load). When pushing the map down to Vertica there is no more data shuffling beyond what any other MR requires. You do get the added flexibility of being able to reduce on a different collection of nodes.
I agree with Omer’s clarification.
Also, just for the record, it’s probably giving me too much credit to say that C-Store was my PhD thesis. My thesis involved research behind building the query execution engine for C-Store, but the C-Store project was much bigger than just the work that I did.
[…] MapReduce support, but with a difference. Unlike Greenplum and Aster, who are bringing it into the database itself, Vertica is providing a streaming connection to Hadoop instances (the open source implementation of MapReduce; Vertica is contributing the adapter to the community). This architecture mirrors usage patterns we’ve seen, and which Vertica asserts its customers have told them they want. One scenario: use your ADBMS to retrieve stored data, pass it to Hadoop for analysis by staff with different skill sets from the typical ADBMS users, and then bring result sets back. A separate hardware for the Hadoop sandbox is fairly typical among early adopters today, and via a Cloudera partnership, Vertica can offer a deployment architecture that doesn’t break the bank. Curt Monash does the usual excellent summary of Hadoop issues in his blog. […]
[…] (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) Vertica/Hadoop usage seems to have started in Vertica’s financial services stronghold — specifically […]
[…] TechWorld.com ran a story that I thought was interesting and closer to the truth about the relational databases and big data. “Sybase Embraces Google MapReduce” runs down a number of data management companies expressing interest in one of Google’s earlier innovations. One comment worth noting in my opinion was: Relational database pioneer Michael Stonebraker co-authored a paper earlier this year contending that that SQL technology still beats MapReduce in most cases. But that conclusion didn’t stop Vertica Systems, the startup where he serves as CTO, from adding Hadoop functionality to its new Vertica 3.5 database. […]