August 4, 2009

Vertica’s version of MapReduce integration

I talked with Omer Trajman of Vertica Monday night about Vertica’s MapReduce integration, part of its Vertica 3.5 release. Highlights included:

By “integrating Vertica and MapReduce,” Vertica means “integrating Vertica and Hadoop.”
Vertica’s Hadoop integration is based on Cloudera’s DBInputFormat.
Omer called out for me several features of Vertica’s Hadoop integration that didn’t just come from Cloudera, namely:
- Cloudera’s DBInputFormat assumes the database runs on a single computer, or a single head node of an MPP system. Vertica’s technology, however, runs on peer parallel nodes with no head, and so Vertica adapted the DBInputFormat technology accordingly.
- Vertica lets you push down Map functions to the database. Omer reports a roughly even division among users and prospects between those who want to do this and ones who don’t.
- Vertica lets you do Reduce functions (or Map functions, if you don’t push them down to the database) on a separate cluster than you run the database software. Vertica asserts that its customers and prospects all want to do this. Right here is the big difference between Vertica’s MapReduce integration and Aster’s or Greenplum’s. (Aster would also say that Vertica’s weaker MapReduce/SQL programming integration is a big difference as well.)
- Indeed, Vertica lets you Reduce into a different DBMS than Vertica, if you choose.
- Vertica gives you flexibility on the size of the Map and Reduce clusters. Omer agreed with me when I said there were some limits on how fast one can add or subtract nodes in a Vertica grid, because there’s data redistribution involved. But one can add/change/delete Hadoop clusters extremely quickly.

Apparently, the use cases for Vertica/Hadoop integration to date lie in algorithmic trading and two kinds of web analytics. Specifically:

One or more Vertica customers are using MapReduce in production to do relatively simple transforms of web log data
Vertica customers are experimenting with — but have not yet put into production — more sophisticated pattern analysis of web log data.
Financial services customers are using MapReduce for a lot of experimentation in discovering new algorithms. The idea is that DBMS/MapReduce integration offers rapid prototyping of algorithmic ideas. Those that pan out are then reimplemented for production, presumably in some kind of CEP (Complex Event Processing) system. These users seem to be ones that are pushing down a lot of Map functions to the Vertica DBMS.

By the way, Vertica is based on C-Store, the Ph.D. thesis project of Daniel Abadi, who recently wrote:

To me, it is far more efficient from a performance and a “green” perspective to push the computation to the data. Hence, I am not a fan of decoupling the compute grid and the data grid.

Not coincidentally, Daniel also recently wrote that

If the VectorWise/Ingres solution does get released open source, I believe they will be an excellent column-store storage engine for HadoopDB. I have already requested an academic preview edition of their software to play with.

The VectorWise guys also told me they are looking forward to seeing how the two projects work together.

Categories: Analytic technologies, Cloudera, Columnar database management, Data warehousing, Hadoop, Investment research and trading, MapReduce, Parallelization, Theory and architecture, VectorWise, Vertica Systems, Web analytics

Subscribe to our complete feed!

Comments

5 Responses to “Vertica’s version of MapReduce integration”

Omer Trajman on August 4th, 2009 8:12 am

One clarification regarding compute/data locality. MR necessarily has a data re-distribution phase prior to reduce (unless data is distributed by map key on load). When pushing the map down to Vertica there is no more data shuffling beyond what any other MR requires. You do get the added flexibility of being able to reduce on a different collection of nodes.
Daniel Abadi on August 4th, 2009 11:01 am

I agree with Omer’s clarification.

Also, just for the record, it’s probably giving me too much credit to say that C-Store was my PhD thesis. My thesis involved research behind building the query execution engine for C-Store, but the C-Store project was much bigger than just the work that I did.
Vertica Projects Leadership, Embraces MapReduce (Sorta) « Market Strategies for IT Suppliers on August 11th, 2009 10:02 pm

[…] MapReduce support, but with a difference. Unlike Greenplum and Aster, who are bringing it into the database itself, Vertica is providing a streaming connection to Hadoop instances (the open source implementation of MapReduce; Vertica is contributing the adapter to the community). This architecture mirrors usage patterns we’ve seen, and which Vertica asserts its customers have told them they want. One scenario: use your ADBMS to retrieve stored data, pass it to Hadoop for analysis by staff with different skill sets from the typical ADBMS users, and then bring result sets back. A separate hardware for the Hadoop sandbox is fairly typical among early adopters today, and via a Cloudera partnership, Vertica can offer a deployment architecture that doesn’t break the bank. Curt Monash does the usual excellent summary of Hadoop issues in his blog. […]
How 30+ enterprises are using Hadoop | DBMS2 -- DataBase Management System Services on December 11th, 2009 11:26 pm

[…] (Vertica recently made its 100th sale, and of course not all those buyers are in production yet.) Vertica/Hadoop usage seems to have started in Vertica’s financial services stronghold — specifically […]
Will a Shotgun Marriage Avert Squabbles among the Data Clans : Beyond Search on December 17th, 2009 2:03 am

[…] TechWorld.com ran a story that I thought was interesting and closer to the truth about the relational databases and big data. “Sybase Embraces Google MapReduce” runs down a number of data management companies expressing interest in one of Google’s earlier innovations. One comment worth noting in my opinion was: Relational database pioneer Michael Stonebraker co-authored a paper earlier this year contending that that SQL technology still beats MapReduce in most cases. But that conclusion didn’t stop Vertica Systems, the startup where he serves as CTO, from adding Hadoop functionality to its new Vertica 3.5 database. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Vertica’s version of MapReduce integration

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin