May 29, 2008

Yahoo scales its web analytics database to petabyte range

Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:

The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.

Categories: Analytic technologies, Columnar database management, Data warehousing, MySQL, Petabyte-scale data management, PostgreSQL, Specific users, Theory and architecture, Yahoo

Subscribe to our complete feed!

Comments

13 Responses to “Yahoo scales its web analytics database to petabyte range”

dave on May 30th, 2008 7:06 am

actually there was a bit more to it – the article described a small startup acquired by yahoo that supported their changes around mysql architecture (mahit? i forget the exact name)

as for their column-store comment, i saw that too though they offered zero in the way of supporting evidence, though one may assume that too many yahoo properties rely on mutltvariable-intensive searches and so wouldn’t be quite as well served…
Curt Monash on May 30th, 2008 5:04 pm

On the column store point — sometimes there isn’t much difference between a vertically partitioned row store and a true column store, as per http://www.dbms2.com/2007/03/19/datallegro-versus-vertica-columnar-systems/.

CAM
Curt Monash on May 30th, 2008 5:08 pm

Eric Lai of Computerworld has an article too: http://www.infoworld.com/news/feeds/08/05/22/Yahoo-claims-2-petabyte-database-is-worlds-biggest–busiest.html

Again, he says PostgreSQL, not MySQL. The fact that it’s an acquisition may help explain why it’s not MySQL. 🙂

The name was Mahat, with I gather is a philosophically-inspiring word in Sanskrit or something.

CAM
Daniel Weinreb on May 31st, 2008 6:55 am

The claim that they got great performance advantages by optimizing for a specific application sounds very plausible to me. If you look at the published literature from companies like Amazon and Google about their high-performance, high-availability systems, these papers explain all kinds of interesting techniques that buy lots of performance by providing semantics that are unconventional, but carefully optimized for the particular needs and tradeoffs of their applications.
Response to Rita Sallam of Oracle | DBMS2 -- DataBase Management System Services on June 28th, 2008 4:35 am

[…] by the way — the largest Oracle warehouse by far on that list is at Yahoo. But Oracle isn’t Yahoo’s major data warehouse software provider. If a shared disk architecture is not scalable, then how is it that Oracle is the leader in Data […]
david bandel on August 25th, 2008 12:02 pm

because Oracle won’t support their custom DB structure.
Yahoo reaches 1-Petabyte… « Wisps in the Ethereal on August 25th, 2008 2:14 pm

[…] You can read more on the Yahoo side of things HERE. […]
Infobright’s open source move has a lot of potential | DBMS2 -- DataBase Management System Services on September 15th, 2008 8:05 am

[…] data. Those outfits have already been buying massive data warehouse appliances – or doing things even more dramatic — and don’t need Infobright. But for anybody else in the MySQL world who needs […]
Some of Oracle’s largest data warehouses | DBMS2 -- DataBase Management System Services on September 24th, 2008 8:22 pm

[…] one of Greenplum’s flagship accounts. And despite its ongoing Oracle relationship Yahoo has a much bigger data warehouse based on Postgres […]
eBay’s two enormous data warehouses | DBMS2 -- DataBase Management System Services on April 30th, 2009 6:25 am

[…] web/network events database, running on proprietary software, sounded about 1/6th the size of eBay’s Greenplum system when it was described about a year […]
Analytics Team » Blog Archive » Web analytics databases keep getting bigger on April 30th, 2009 10:23 pm

[…] Ebay has a 6.5 petabyte Greenplum warehouse and a 2.5 petabyte Teradata warehouse. This system ingests hundreds of billions of new rows of data every day. Facebook has a 2.5 petabyte Hadoop system Yahoo has more than 1 petabyte running on their homemade system […]
Yahoo is up to 10 petabytes now? | DBMS2 -- DataBase Management System Services on July 6th, 2009 2:03 am

[…] to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, […]
The fall (and rise?) of Yahoo: How the web giant crumbled and built some great tech in the process — Tech News and Analysis on November 27th, 2013 9:01 am

[…] relational databases, NoSQL databases and even a columnar analytic database called Everest that was designed for querying big data related to targeted advertising. He views Yahoo’s decision to port so many workloads to […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Yahoo scales its web analytics database to petabyte range

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin