Yahoo wants to do decapetabyte-scale data warehousing in Hadoop
My old client Mark Tsimelzon moved over to Yahoo after Coral8 was acquired, and I caught up with him last month. He turns out to be running development for a significant portion of Yahoo’s Hadoop effort — everything other than HDFS (Hadoop Distributed File System). Yahoo evidently plans to, within a year or so, get Hadoop to the point that it is managing 10s of petabytes of data for Yahoo, with reasonable data warehousing functionality.
Highlights of our visit included:
- There are dozens of people at Yahoo doing Hadoop development that will wind up getting open sourced. (Full-time or close to it.) In particular, everything Mark’s team does goes to open source.
- Yahoo is moving as much of its analytics to Hadoop as possible. Much of this is being moved away from Oracle and from Yahoo’s own Everest.
- A column store is being put on top of HDFS, based on Yahoo technology. Columns will be striped across nodes. Perhaps that’s why the effort is called Project Zebra.
- Mark believes that in a year Hadoop will be much further along in meeting traditional data warehousing requirements, in areas such as:
- Metadata
- SLAs/high availability/other workload management
- Data retention policies
- Security/privacy*
- Yahoo views the time-to-market benefits of Hadoop as being more important than TCO.
*I also spoke with a couple of Mark’s Yahoo colleagues, on his introduction, who are being less helpful than he is about clarifying what I am or am not allowed to say for publication. But I will say that I was heartened by the degree of concern they showed for doing the right thing with regard to privacy. I was not as heartened by the concrete ideas — or lack thereof — for making it happen. But frankly, I don’t think it’s a solvable technical problem. Rather, it should be a huge priority on the legal/political front.
We also talked some about Pig, Yahoo’s non-SQL DML (Data Manipulation Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. Hive. But I recently heard a rumor all that is in flux, so I won’t write it up now.
Mark sent along a couple of interesting slide presentations by a colleague. After some back and forth as to whether I could post them, he suggested I post these links to similar material instead.
Comments
6 Responses to “Yahoo wants to do decapetabyte-scale data warehousing in Hadoop”
Leave a Reply
Curt, I’m curious: how much data is Yahoo currently managing total and do they use commercial RDBMSs at all and if so which ones?
Thanks.
Jerome,
As per other threads, it’s clear Yahoo is using quite a bit of Oracle.
Otherwise, I couldn’t say.
Besides Oracle and their internal (Everest) engine you mentioned I mean (am assuming it’s not all ORCL is it?)
The video is really full of interesting stuff. I wonder how many “nodes” they work with and what kind of fabric is used and how these things are clustered . Isnt Xen the same VM EC2 is using?
Fascinating stuff.
Everest = Yahoo’s proprietary Postgres-based column store that is managing petabytes of data.
Curt, you may want to add that Yahoo’s Hadoop team is growing fast! If anybody wants to join us, we are looking for developers, architects, testers, and managers: http://developer.yahoo.net/blogs/hadoop/2009/10/do_you_have_what_it_takes_to_j.html
[…] Yahoo’s wants to do decapetabyte-scale data warehousing in Hadoop (dbms2.com) […]