Hadoop hardware and compression
A month ago, I posted about typical Hadoop hardware. After talking today with Eric Baldeschwieler of Hortonworks, I have an update. I also learned some things from Eric and from Brian Christian of Zettaset about Hadoop compression.
First the compression part. Eric thinks 6-10X compression is common for “curated” Hadoop data — i.e., the data that actually gets used a lot. Brian used an overall figure of 6-8X, and told of a specific customer who had 6X or a little more. By way of comparison, it sounds as if the kinds of data involved are like what Vertica claimed 10-60X compression for almost three years ago.
Eric also made an excellent point about low-value machine-generated data. I was suggesting that as Moore’s Law made sensor networks ever more affordable:
- There would be lots more data thrown off.
- A lot of it would be repetitive “I’m fine; nothing to report” kinds of events.
- It would be a good idea to filter this low-value information out rather than permanently storing it.
Eric retorted that such data compresses extremely well. He was, of course, correct. If you have a long sequence or other large amount of identical data, and the right compression algorithms* — yeah, that compresses really well.
*Think run-length encoding (RLE), delta, or tokenization with variable-length tokens.
While I was at it, I asked Eric what might be typical for Hadoop temp/working space. He said at Yahoo it was getting down to 1/4 of the disk, from a previous range of 1/3.
Anyhow, Yahoo’s most recent standard Hadoop nodes feature:
- 8-12 cores
- 48 gigabytes of RAM
- 12 disks of 2 or 3 TB each
If you divide 12 by 3 for standard Hadoop redundancy, and take off 1/4, then you have 6-9 TB/node. Multiple that by a compression factor of 6-10X, at least for the “curated data,” and you get to 36-90 TB of user data per node.
As an alternative, suppose we take a point figure from Cloudera’s ranges of 16 TB of spinning disk per node (8 spindles, 2 TB/disk). Go with the 6X compression figure. Lop off 1/3 for temp space. That more conservative calculation leaves us a bit over 20 TB/node, which is probably a more typical figure among today’s Hadoop users.
Comments
10 Responses to “Hadoop hardware and compression”
Leave a Reply
[…] works out near the low end of the range I came up with for Yahoo’s newest gear, namely 36-90 TB/node. Yahoo’s biggest clusters are little over 4,000 nodes (a limitation that’s getting […]
Are there customer references for any of it? Vendor claims about compression are always grain of salt..
Hortonworks and Zettaset gave me one customer name each with specific figures. Indeed, Hortonworks didn’t give me data for anybody except Yahoo.
One thing: As Merv Adrian points out privately, I neglected to check just where and by whom this compression tends to be implemented. I’ll look into it.
The 6-10x number was from Yahoo!, where we store a lot of log data on Hadoop. In general we are simply using text or block compressed sequence files. In both cases the data is simply gzipped.
Much higher compression ratios are possible for some data types or using more advanced techniques. With HCatalog we intend to support such techniques on columnar and row based data representations.
Eric,
Do I understand correctly that compression is not now baked into Hadoop, but will be with HCatalog?
You can easily use compression today from map-reduce by specifying your choice of codec to various input / output formats. Yahoo compresses all of its big data sets via this option.
Compression is not implemented in the HDFS layer and we’ve no immediate plans to put it there.
HCatalog will separate application code from data location and representation choices (those input / output formats), making it much easier to role out new data layouts and compression codecs. I think this will spur rapid improvements in the average level of compression in Hadoop.
Eric,
So you’re saying compression happens on the application tier rather than the storage tier? Makes sense to me. In particular, it should be AT LEAST as friendly to late-decompression than the alternative.
Thanks,
CAM
[…] – What range is Hadoop compression factor? “6-10X compression is common for “curated” Hadoop […]
[…] told me of one big customer — with an almost-petabyte Hadoop cluster before compression — namely Zions […]
[…] deployment, with about 42,000 nodes with a total of between 180 and 200 petabytes of storage. Their standard node has between 8 and 12 cores, 48 GB of memory, and 24 to 36 terabytes of […]