May 14, 2009
Facebook’s experiences with compression
One little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to be:
- Facebook uses gzip, and gets a little bit more than 6X compression.
- Experiments suggest bzip2 would reduce data by another 20% or so, increasing compression to the 7.5X range.
- The downside of bzip2 is 15-25% processing overhead, depending on the kind of data.
Categories: Data warehousing, Database compression, Facebook, Hadoop
Subscribe to our complete feed!
Comments
2 Responses to “Facebook’s experiences with compression”
Leave a Reply
Also be careful about the memory usage of bzip2, which may be prohibitively high on large data sets.
On gzip there are 9 different compression levels, level 6 seems to give the best balance between cost and data size. 6X seems about right for web log data
Ivan,
Memory usage is definitely an issue and that typically means that we would not be able to run as many map/reduce slots in the cluster.
We are targeting this mostly for archival at this point and there the latency requirements on decompressing or compressing this data is not that high yet.