May 29, 2008
Yahoo scales its web analytics database to petabyte range
Information Week has an article with details on what sounds like Yahoo’s core web analytics database. Highlights include:
- The Yahoo web analytics database is over 1 petabyte. They claim it will be in the 10s of petabytes by 2009.
- The Yahoo web analytics database is based on PostgreSQL. So much for MySQL fanboys’ claims of Yahoo validation for their beloved toy … uh, let me rephrase that. The highly-regarded MySQL, although doing a great job for some demanding and impressive applications at Yahoo, evidently wasn’t selected for this one in particular. OK. That’s much better now.
- But the Yahoo web analytics database doesn’t actually use PostgreSQL’s storage engine. Rather, Yahoo wrote something custom and columnar.
- Yahoo is processing 24 billion “events” per day. The article doesn’t clarify whether these are sent straight to the analytics store, or whether there’s an intermediate storage engine. Most likely the system fills blocks in RAM and then just appends them to the single persistent store. If commodity boxes occasionally crash and lose a few megs of data — well, in this application, that’s not a big deal at all.
- Yahoo thinks commercial column stores aren’t ready yet for more than 100 terabytes of data.
- Yahoo says it got great performance advantages from a custom system by optimizing for its specific application. I don’t know exactly what that would be, but I do know that database architectures for high-volume web analytics are still in pretty bad shape. In particular, there’s no good way yet to analyze the specific, variable-length paths users take through websites.
Categories: Analytic technologies, Columnar database management, Data warehousing, MySQL, Petabyte-scale data management, PostgreSQL, Specific users, Theory and architecture, Yahoo
Subscribe to our complete feed!
Comments
13 Responses to “Yahoo scales its web analytics database to petabyte range”
Leave a Reply
actually there was a bit more to it – the article described a small startup acquired by yahoo that supported their changes around mysql architecture (mahit? i forget the exact name)
as for their column-store comment, i saw that too though they offered zero in the way of supporting evidence, though one may assume that too many yahoo properties rely on mutltvariable-intensive searches and so wouldn’t be quite as well served…
On the column store point — sometimes there isn’t much difference between a vertically partitioned row store and a true column store, as per http://www.dbms2.com/2007/03/19/datallegro-versus-vertica-columnar-systems/.
CAM
Eric Lai of Computerworld has an article too: http://www.infoworld.com/news/feeds/08/05/22/Yahoo-claims-2-petabyte-database-is-worlds-biggest–busiest.html
Again, he says PostgreSQL, not MySQL. The fact that it’s an acquisition may help explain why it’s not MySQL. 🙂
The name was Mahat, with I gather is a philosophically-inspiring word in Sanskrit or something.
CAM
The claim that they got great performance advantages by optimizing for a specific application sounds very plausible to me. If you look at the published literature from companies like Amazon and Google about their high-performance, high-availability systems, these papers explain all kinds of interesting techniques that buy lots of performance by providing semantics that are unconventional, but carefully optimized for the particular needs and tradeoffs of their applications.
[…] by the way — the largest Oracle warehouse by far on that list is at Yahoo. But Oracle isn’t Yahoo’s major data warehouse software provider. If a shared disk architecture is not scalable, then how is it that Oracle is the leader in Data […]
because Oracle won’t support their custom DB structure.
[…] You can read more on the Yahoo side of things HERE. […]
[…] data. Those outfits have already been buying massive data warehouse appliances – or doing things even more dramatic — and don’t need Infobright. But for anybody else in the MySQL world who needs […]
[…] one of Greenplum’s flagship accounts. And despite its ongoing Oracle relationship Yahoo has a much bigger data warehouse based on Postgres […]
[…] web/network events database, running on proprietary software, sounded about 1/6th the size of eBay’s Greenplum system when it was described about a year […]
[…] Ebay has a 6.5 petabyte Greenplum warehouse and a 2.5 petabyte Teradata warehouse. This system ingests hundreds of billions of new rows of data every day. Facebook has a 2.5 petabyte Hadoop system Yahoo has more than 1 petabyte running on their homemade system […]
[…] to somebody (I forget who) who attended Yahoo’s SIGMOD presentation last week, the big Yahoo database is now up to 10 petabytes in size, in line with Yahoo’s predictions last year. Apparently, […]
[…] relational databases, NoSQL databases and even a columnar analytic database called Everest that was designed for querying big data related to targeted advertising. He views Yahoo’s decision to port so many workloads to […]