Hardware for Hadoop
After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:
- Hadoop nodes today tend to run on fairly standard boxes.
- Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.
- The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.
A key input comes from Cloudera, who to my joy delegated the questions to Omer Trajman, who wrote:
Most Hadoop deployments today use systems with dual socket and quad or hex cores (8 or 12 cores total, 16 or 24 hyper-threaded). Storage has increased as well with 6-8 spindles being common and some deployments going to 12 spindles. These are SATA disks with between 1TB and 2TB capacity. The amount of RAM varies depending on the application. 24GB is common as is 36GB – all ECC RAM. HBase clusters may have more RAM so they can cache more data. Some customers put Hadoop on their “standard box” which may not be perfectly balanced (e.g. more RAM, less disk) and needs to be altered slightly to meet the above specs. The new Dell C2100 series and the HP SL170 series are both popular server lines for Hadoop.
For a year ago perspective, see this post: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
Bullet points from that year-ago link include:
- 4 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
- 2 quad core CPUs, running at least 2-2.5GHz
- 16-24GBs of RAM (24-32GBs if you’re considering HBase)
- Gigabit Ethernet
So basically we’re talking in the range of 2-3 GB of RAM per core — and 1 spindle per core, up from perhaps half a spindle per core a year ago.
Meanwhile, a 2009 Yahoo slide deck refers to “500 nodes, 4000 cores, 3TB RAM, 1.5PB disk”; that divides out to 8 cores, 6 GB of RAM, and 3 TB of disk per node, all on “commodity hardware.” By 2010 Yahoo was evidently up to 2 GB of RAM per core.
There are lots of data points on the Apache Hadoop wiki, but many seem a few years old, and I don’t immediately see how to time-stamp them. Overall, they seem consistent with the trends I noted at the top of the post.
One thing I haven’t done is attempted to price any of these systems.
Contributions in the comment thread would be warmly appreciated.
Comments
11 Responses to “Hardware for Hadoop”
Leave a Reply
[…] in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a […]
You can hit up the major hardware vendors sites and get list pricing in the $3-5k/node range.
That’s cheap indeed.
Another site for mining information on Hadoop cluster sizes is the Sort Benchmark page (sortbenchmark.org). Hadoop sponsored by Yahoo has won a couple of categories over the last few years and the disclosure document describes the cluster configuration in detail. The results page shows a summary, for example, in 2010, Hadoop sorted 100 TB in 173 minutes on a cluster of 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). The clusters are standard clusters at Yahoo.
Thanks, Richard. In most cases benchmarks are run on bogus equipment, but it makes sense that this might be an exception to that rule.
Any thoughts on how the Katta project’s (http://katta.sourceforge.net/) implementation of Distributed Lucene might skew the choice of h/w for a Hadoop cluster?
[…] an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop […]
[…] Hardware for Hadoop […]
[…] does share a lot of information, there aren’t many good recaps of hardware being used. Monash Research has a good writeup that also compares how Hadoop hardware has changed over the past couple […]
[…] The key spec ratios are 1 core/4 GB RAM/3 TB raw disk. That’s reasonably in line with Cloudera figures I published in June, 2010. […]
[…] I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware. […]