October 1, 2009

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.

Highlights of our visit included:

There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.

*I also spoke with a couple of Mark’s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas — or lack thereof — for making it happen. But frankly, I don’t think it’s a solvable technical problem. Rather, it should be a huge priority on the legal/political front.

We also talked some about Pig, Yahoo’s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. Hive. But I recently heard a rumor all that is in flux, so I won’t write it up now.

Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post these links to similar material instead.

Categories: Analytic technologies, Data warehousing, Hadoop, MapReduce, Open source, Oracle, Petabyte-scale data management, Web analytics, Yahoo

Subscribe to our complete feed!

Comments

6 Responses to “Yahoo wants to do decapetabyte-scale data warehousing in Hadoop”

Jerome Pineau on October 1st, 2009 4:03 pm

Curt, I’m curious: how much data is Yahoo currently managing total and do they use commercial RDBMSs at all and if so which ones?
Thanks.
Curt Monash on October 1st, 2009 4:17 pm

Jerome,

As per other threads, it’s clear Yahoo is using quite a bit of Oracle.

Otherwise, I couldn’t say.
Jerome Pineau on October 1st, 2009 4:23 pm

Besides Oracle and their internal (Everest) engine you mentioned I mean (am assuming it’s not all ORCL is it?)
The video is really full of interesting stuff. I wonder how many “nodes” they work with and what kind of fabric is used and how these things are clustered . Isnt Xen the same VM EC2 is using?

Fascinating stuff.
Curt Monash on October 1st, 2009 6:37 pm

Everest = Yahoo’s proprietary Postgres-based column store that is managing petabytes of data.
Mark Tsimelzon on October 1st, 2009 7:19 pm

Curt, you may want to add that Yahoo’s Hadoop team is growing fast! If anybody wants to join us, we are looking for developers, architects, testers, and managers: http://developer.yahoo.net/blogs/hadoop/2009/10/do_you_have_what_it_takes_to_j.html
Bioinformatics and mythology. You still need to manage the data on December 9th, 2009 11:43 pm

[…] Yahoo’s wants to do decapetabyte-scale data warehousing in Hadoop (dbms2.com) […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Yahoo wants to do decapetabyte-scale data warehousing in Hadoop

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin