June 4, 2011

Hardware for Hadoop

After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:

Hadoop nodes today tend to run on fairly standard boxes.
Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.
The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.

A key input comes from Cloudera, who to my joy delegated the questions to Omer Trajman, who wrote:

Most Hadoop deployments today use systems with dual socket and quad or hex cores (8 or 12 cores total, 16 or 24 hyper-threaded). Storage has increased as well with 6-8 spindles being common and some deployments going to 12 spindles. These are SATA disks with between 1TB and 2TB capacity. The amount of RAM varies depending on the application. 24GB is common as is 36GB – all ECC RAM. HBase clusters may have more RAM so they can cache more data. Some customers put Hadoop on their “standard box” which may not be perfectly balanced (e.g. more RAM, less disk) and needs to be altered slightly to meet the above specs. The new Dell C2100 series and the HP SL170 series are both popular server lines for Hadoop.

For a year ago perspective, see this post: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/

Bullet points from that year-ago link include:

4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration

2 quad core CPUs, running at least 2-2.5GHz

16-24GBs of RAM (24-32GBs if you’re considering HBase)

Gigabit Ethernet

So basically we’re talking in the range of 2-3 GB of RAM per core — and 1 spindle per core, up from perhaps half a spindle per core a year ago.

Meanwhile, a 2009 Yahoo slide deck refers to “500 nodes, 4000 cores, 3TB RAM, 1.5PB disk”; that divides out to 8 cores, 6 GB of RAM, and 3 TB of disk per node, all on “commodity hardware.” By 2010 Yahoo was evidently up to 2 GB of RAM per core.

There are lots of data points on the Apache Hadoop wiki, but many seem a few years old, and I don’t immediately see how to time-stamp them. Overall, they seem consistent with the trends I noted at the top of the post.

One thing I haven’t done is attempted to price any of these systems.

Contributions in the comment thread would be warmly appreciated.

Categories: Cloudera, Hadoop, Pricing, Storage, Yahoo

Subscribe to our complete feed!

Comments

11 Responses to “Hardware for Hadoop”

Dirty data, stored dirt cheap | DBMS 2 : DataBase Management System Services on June 4th, 2011 5:49 pm

[…] in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a […]
Omer Trajman on June 4th, 2011 6:06 pm

You can hit up the major hardware vendors sites and get list pricing in the $3-5k/node range.
Curt Monash on June 4th, 2011 6:39 pm

That’s cheap indeed.
Richard Taylor on June 5th, 2011 2:32 pm

Another site for mining information on Hadoop cluster sizes is the Sort Benchmark page (sortbenchmark.org). Hadoop sponsored by Yahoo has won a couple of categories over the last few years and the disclosure document describes the cluster configuration in detail. The results page shows a summary, for example, in 2010, Hadoop sorted 100 TB in 173 minutes on a cluster of 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). The clusters are standard clusters at Yahoo.
Curt Monash on June 5th, 2011 3:51 pm

Thanks, Richard. In most cases benchmarks are run on bogus equipment, but it makes sense that this might be an exception to that rule.
Jamie M on June 6th, 2011 7:39 am

Any thoughts on how the Katta project’s (http://katta.sourceforge.net/) implementation of Distributed Lucene might skew the choice of h/w for a Hadoop cluster?
Hadoop hardware and compression | DBMS 2 : DataBase Management System Services on July 6th, 2011 12:09 am

[…] an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop […]
M-A-O-L » Hadoop Update on September 24th, 2011 4:37 pm

[…] Hardware for Hadoop […]
Hadoop hardware | Analytics Team on October 9th, 2011 9:26 pm

[…] does share a lot of information, there aren’t many good recaps of hardware being used. Monash Research has a good writeup that also compares how Hadoop hardware has changed over the past couple […]
Notes on the Oracle Big Data Appliance : DBMS 2 : DataBase Management System Services on January 10th, 2012 8:32 pm

[…] The key spec ratios are 1 core/4 GB RAM/3 TB raw disk. That’s reasonably in line with Cloudera figures I published in June, 2010. […]
Notes on Hadoop hardware | DBMS 2 : DataBase Management System Services on October 17th, 2012 8:08 am

[…] I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware. […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Hardware for Hadoop

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin