October 17, 2012

Notes on Hadoop hardware

I talked with Cloudera yesterday about an unannounced technology, and took the opportunity to ask some non-embargoed questions as well. In particular, I requested an update to what I wrote last year about typical Hadoop hardware.

Cloudera thinks the picture now is:

2-socket servers, with 4- or 6-core chips.
Increasing number of spindles, with 12 2-TB spindles being common.
48 gigs of RAM is most common, with 64-96 fairly frequent.
A couple of 1GigE networking ports.

Discussion around that included:

Enterprises had been running out of storage space; hence the increased amount of storage. 🙂
Even more storage can be stuffed on a node, and at times is. But at a certain point there’s so much data on a node that recovery from node failure is too forbidding.
There are some experiments with 10 GigE.

The foregoing applies to software-only Hadoop, specifically Cloudera’s distribution. The Hadoop appliances that Cloudera is familiar with tend to have higher-end hardware — more CPUs, “fancier” drives, and/or InfiniBand. If I understood correct, the same is somewhat true of hardware vendors’ pseudo-appliance recommended configurations.

My hunches about all that include:

Footprint can matter. Not every enterprise has a cheap data center drawing cheap power by the Columbia River.
As Cray and SAS both teach us, some analytic techniques do require high-speed interconnects.
There’s nothing wrong with having 2 or more Hadoop clusters. One can have cheap gear, and be the ultimate big bit bucket. The other could have more expensive gear, and perhaps additional software as well. That’s even before you start thinking about cloud vs. on-premise alternatives.

And finally — as long as MapReduce persists intermediate result sets after every computational step, I wonder whether solid-state cache could be useful. An analogy could be the way analytic RDBMS can use flash for temp space, although I must admit that I can’t think of a lot of RDBMS installations configured to take advantage of that possibility.

Categories: Cloudera, Data warehouse appliances, Hadoop, MapR, Solid-state memory, Storage

Subscribe to our complete feed!

Comments

7 Responses to “Notes on Hadoop hardware”

Notes on analytic hardware | DBMS 2 : DataBase Management System Services on October 17th, 2012 8:10 am

[…] to my conjecture that if MapReduce insists on writing to persistent storage at every step, you might want to have flash cache just for […]
Charles Zedlewski on October 17th, 2012 11:01 am

Hi Curt,

I think it’s fair to explore all these possibilities, just so long as the thought process is:

$$ per node from adding a new part (e.g. flash for temp) X total number of nodes in the cluster / cost per node.

In other words, how many nodes could each fancy part have bought me? And then would I be better, faster, cheaper with the fancy part or the additional nodes? That’s the “hurdle rate” that new node components or component upgrades have to get over.
Curt Monash on October 17th, 2012 11:37 am

Charles,

Thanks for chiming in!

You’re right of course, except that those aren’t just capex nodes, they’re opex as well.

If I take the total cost of my hardware purchases up 20% — to pick a number rather at random — while taking power consumption and floor space up 0%, that’s not really a 20% cost increase at all.

Best,

CAM
Ken Farmer on October 18th, 2012 12:31 pm

Curt,

As you point out organizations without a Google or Facebook data center, ie 99% of them, can pay anywhere from $3-15k/year to support each os image. Increasing the node speed & price can dramatically drop total costs for many customers.

And is the changing picture of an ideal server really due to changing requirements or a better understanding of the problem? Weren’t IO constraints always a problem? And are we seeing these slowly evolving to resemble parallel database nodes?
Quick notes on Impala | DBMS 2 : DataBase Management System Services on October 24th, 2012 10:52 am

[…] 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like […]
Paul Johnson on October 31st, 2012 5:22 am

The amount of storage per node is a key determinant of system performance. Less storage per node = more nodes required and vice versa. The number of nodes is obviously central to overall performance in any clustered system.

High storage per node can deliver a relatively cheap ‘high capacity low throughput’ system. Low storage per node can deliver a relatively expensive ‘low capacity high throughput’ system. A balanced system will be somewhere in the middle.

Teradata’s patented ‘bynet’ interconnect demonstrates the value of a high-speed interconnect in a clustered system. Lack of interconnect scalability can quickly become the bottelneck once the number of nodes is high enough to saturate the network, assuming data is routinely moved between compute nodes.
Hortonworks, Hadoop, Stinger and Hive | DBMS 2 : DataBase Management System Services on August 7th, 2013 2:53 am

[…] or default Hadoop node these days. Happily, the answers seemed like straightforward upgrades to what Cloudera said in October, 2012. Specifics […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Notes on Hadoop hardware

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin