Petabyte-scale Hadoop clusters (dozens of them)
I recently learned that there are 7 Vertica clusters with a petabyte (or more) each of user data. So I asked around about other petabyte-scale clusters. It turns out that there are several dozen such clusters (at least) running Hadoop.
Cloudera can identify 22 CDH (Cloudera Distribution [of] Hadoop) clusters holding one petabyte or more of user data each, at 16 different organizations. This does not count Facebook or Yahoo, who are huge Hadoop users but not, I gather, running CDH. Meanwhile, Eric Baldeschwieler of Hortonworks tells me that Yahoo’s latest stated figures are:
- 42,000 Hadoop nodes …
- … holding 180-200 petabytes of data.
That works out near the low end of the range I came up with for Yahoo’s newest gear, namely 36-90 TB/node. Yahoo’s biggest clusters are little over 4,000 nodes (a limitation that’s getting worked on), and Yahoo has over 20 clusters in total.
Based on those numbers, it would seem that 10 or more of Yahoo’s Hadoop clusters are probably in the petabyte range. Facebook no doubt has a few petabyte-scale Hadoop clusters as well. So we’re probably over 3 dozen petabyte+ Hadoop clusters, just counting Yahoo, Facebook, and CDH users. There surely are others too, running Apache Hadoop without Cloudera’s help.
We also have some more information about the scale of Hadoop usage, and the markets it is being used in, because Omer Trajman of Cloudera kindly wrote the following — lightly edited as usual — for quotation:
The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.
Omer went on to add:
The biggest number of PB clusters are in the advertising space. I often tell people that every ad you see on the internet touched at least one Hadoop cluster (or the Google equivalent).
Comments
13 Responses to “Petabyte-scale Hadoop clusters (dozens of them)”
Leave a Reply
[…] his DBMS2 blog this morning, database expert Curt Monash quotes Cloudera Vice President of Customer Solutions Omer Trajman as stating that his employer counts 22 Hadoop clusters (not counting non-Cloudera users Facebook […]
[…] 2011 By admin Leave a Comment In his DBMS2 blog this morning, database expert Curt Monash quotes Cloudera VP of Customer Solutions Omer Trajman as stating that his employer counts 22 Hadoop clusters (not counting non-Cloudera users Facebook […]
[…] claimed to have more than 80 customers running Hadoop in production in March 2011, with 22 clusters running Cloudera’s distribution that are over a petabyte as of July […]
[…] confirms 750 million users, sharing 4 billion items daily; Yahoo: 42,000 Hadoop nodes storing 180-200 petabytes; Formspring hits 25 million […]
What are some companies that work with peta-scale data?…
We count 16 customers with petabyte-sized clusters at Cloudera [1]. Unfortunately, many of our customers choose not to make their names public. It wouldn’t really matter, though: essentially every large company in the web, digital media, telco, financ…
[…] has been called the LAMP stack (Linux, Apache, MySQL, PHP) in the world of “big data”. 22 Hadoop clusters working with more than a petabyte of data have been spotted by Cloudera (a Hadoop vendor), and it seems to be catching on. Hadoop is one of […]
[…] that operate Hadoop production clusters with one petabyte or more of data stored in each cluster. (DBMS2, July 6, […]
[…] Petabyte-scale Hadoop clusters (dozens of them) (dbms2.com) […]
[…] Hadoop World 2011 Mike Olson (Cloudera CEO) Keynote 提到,與會人士中,有13.1% 擁有超過 100TB的資料量,12.8% 擁有超過 1PB的資料量,最大的單一site達到20PB。而他們平均有120 nodes 在其Hadoop Cluster內。Cloudera 自己就有22個客戶擁有超過1PB的Hadoop,而Yahoo全球就有20幾個超過PB的Ha…。 […]
[…] Omer Trajman — who by the way has made a long-overdue official move into technical marketing — can no longer keep count of how many petabyte-scale Hadoop clusters Cloudera supports. […]
[…] Yahoo! has up to 42,000 nodes in its Hadoop grids in July 2011 from Hortonworks Hadoop summit 2011 keynote and Petabyte-scale Hadoop clusters […]
Does anyone else notice the math doesn’t add up?
At 36-90 TB/node capacity, this isn’t efficient use of the nodes storage capacity.
200 petabyte / 48,000 nodes
equals
2 petabyte 480 nodes
equals
200TB/48nodes = 4.1TB/node
200 x 1,000,000,000,000,000 / 48,000 = 4,166,666,666.66667 bytes/node = 4.16 TB/node
Well, there’s a 3X replication factor. And there’s working space. And as old as these figures are, we can assume there isn’t much in the way of compression.