Facebook, Hadoop, and Hive
I few weeks ago, I posted about a conversation I had with Jeff Hammerbacher of Cloudera, in which he discussed a Hadoop-based effort at Facebook he previously directed. Subsequently, Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and subsequently open-sourced.
Updating the metrics in my Cloudera post,
- Facebook has 400 terabytes of disk managed by Hadoop/Hive, with a slightly better than 6:1 overall compression ratio. So the 2 1/2 petabytes figure for user data is reasonable.
- Facebook’s Hadoop/Hive system ingests 15 terabytes of new data per day now, not 10.
- Hadoop/Hive cycle times aren’t as fast as I thought I heard from Jeff. Ad targeting queries are the most frequent, and they’re run hourly. Dashboards are repopulated daily.
Nothing else in my Cloudera post was called out as being wrong.
In a new-to-me metric, Facebook has 610 Hadoop nodes, running in a single cluster, due to be increased to 1000 soon. Facebook thinks this is the second-largest* Hadoop installation, or else close to it. What’s more, Facebook believes it is unusual in spreading all its apps across a single huge cluster, rather than doing different kinds of work on different, smaller sub-clusters.
*Apparently, Yahoo is at 2000 nodes (and headed for 4000), 1000 or so of which are operated as a single cluster for a single app.
Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS. Major drivers of the choice included:
- License/maintenance costs. Free is a good price.
- Open source flexibility. Facebook is one of the few users I’ve ever spoken with that actually cares about modifying open source code.
- Ability to run on cheap hardware. Facebook runs real-time MySQL instances on boxes that cost $10K or so, and would expect to pay at least as much for an MPP DBMS node. But Hadoop nodes run on boxes that cost no more than $4K, and sometimes (depending e.g. on whether they have any disk at all) as little as $2K. These are “true” commodity boxes; they don’t even use RAID.
- Ability to scale out to lots of nodes. Few of the new low-cost MPP DBMS vendors have production systems even today of >100 nodes. (Actually, I’m not certain that any except Netezza do, although Kognitio in a prior release of its technology once built a 900ish node production system.)
- Inherently better performance. Correctly or otherwise, the Facebook guys thought that Hadoop had performance advantages over DBMS, due to the lack of overhead associated with transactions and so on.
One option Facebook didn’t seriously consider was sticking with the incumbent, which Facebook folks regarded as “horrible” and a “lost cause.” The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold. (But based on my discussion with Cloudera, I gather that vendor’s DBMS is indeed used to run some reporting today.)
Reliability of Facebook’s Hadoop/Hive system seems to be so-so. It’s designed for a few nodes at a time to fail; that’s no biggie. There’s a head node that’s a single-point of failure; while there’s a backup node, I gather failover takes 15 minutes or so, a figure the Facebook guys think they could reduce substantially if they put their minds to it. But users submitting long-running queries don’t seem to mind delays of up to an hour, as long as they don’t have to resubmit their queries. Keeping ETL up is a higher priority than keeping query execution going. Data loss would indeed be intolerable, but at that level Hadoop/Hive seems to be quite trustworthy.
There also are occasional longer partial(?) outages, when an upgrade introduces a bug or something, but those don’t seem to be a major concern.
Facebook’s variability in node hardware raises an obvious question — how does Hadoop deal with heterogeneous hardware among its nodes? Apparently a fair scheduling capability has been built for Hadoop, with Facebook as the first major user and Yahoo apparently moving in that direction as well. As for inputs to the scheduler (or any more primitive workload allocator) — well, that depends on the kind of heterogeneity.
- Disk heterogeneity — a distributed file system reports back about disk.
- CPU heterogeneity — different nodes can be configured to run different numbers of concurrent tasks each.
- RAM heterogeneity — Hadoop does not understand the memory requirements of each task, and does not do a good job of matching tasks to boxes accordingly. But the Hadoop community is working to fix this.
Further notes on Hive
Without Hive, some basic Hadoop data manipulations can be a pain in the butt. A GROUP BY or the equivalent could take >100 lines of Java or Python code, and unless the person writing it knew something about database technologically, it could use some pretty sub-optimal algorithms even then. Enter Hive.
Hive sets out to fix this problem. Originally developed at Facebook (in Java, like Hadoop is), Hive was open-sourced last summer, by which time its SQL interface was in place, and now has 6 main developers. The essence of Hive seems to be:
- An interface that implements a subset of SQL
- Compilation of that SQL into a MapReduce configuration file.
- An execution engine to run same.
The SQL implemented so far seems to, unsurprisingly be, what is most needed to analyze Facebook’s log files. I.e., it’s some basic stuff, plus some timestamp functionality. There also is an extensibility framework, and some ELT functionality.
Known users of Hive include Facebook (definitely in production) and hi5 (apparently in production as well). Also, there’s a Hive code committer from Last.fm.
Other links about huge data warehouses:
- eBay has a 6 1/2 petabyte database running on Greenplum and a 2 1/2 petabyte enterprise data warehouse running on Teradata.
- Wal-Mart, Bank of America, another financial services company, and Dell also have very large Teradata databases.
- Yahoo’s web/network events database, running on proprietary software, sounded about 1/6th the size of eBay’s Greenplum system when it was described about a year ago.
- Fox Interactive Media/MySpace has multi-hundred terabyte databases running on each of Greenplum and Aster Data nCluster.
- TEOCO has 100s of terabytes running on DATAllegro.
- To a probably lesser extent, the same is now also true of Dell.
- Vertica has a couple of unnamed customers with databases in the 200 terabyte range.
Comments
48 Responses to “Facebook, Hadoop, and Hive”
Leave a Reply
Hey Curt,
I’m glad you got a chance to speak with Ashish and Joy! They, along with the rest of the Facebook Data team, deserve the credit for designing, building, and operating Facebook’s innovative data infrastructure.
I just wanted to add that the fair scheduler mentioned in your blog post is largely the work of Matei Zaharia, a PhD student at Berkeley’s RAD Lab: http://www.cs.berkeley.edu/~matei/.
Matei’s work is a great example of the power of open source and the Hadoop community. He’s been able to continue his work on the fair scheduler even after he returned to school, and as mentioned in the article, it seems likely that his work is going to be moving into Yahoo!’s clusters as well.
Later,
Jeff
any idea if there are use-case profiles within government monster stores like lawrence livermore, nasa, argonnes projects or elsewhere?
@dave: There is the installation at the University of Nebraska at Lincoln that processes high energy physics data (writeup at cloudera.com: http://www.cloudera.com/blog/2009/05/01/high-energy-hadoop/ ). I dont know what your definition of monster is, but they house around 200TB per cluster it appears.
Also, I run the hadoop cluster for storage of smartgrid data in service of efforts for the NERC at TVA (data off the US powergrid for transmission and generation). We’re in prototype / build-out mode, but we’re heading towards a > 100TB hadoop system in the near future.
Hadoop is a tremendous product and we believe that its price, ecosystem, and general flexibility is great for big data.
One thing that I am suspicious about is the resource utilization of a 610 node hadoop cluster running hourly jobs.
I would make a wild guess that facebook is getting less than 10% average CPU utilization on that cluster over a 24 hour period … please correct me if I am wrong.
Also what data throughput can they achieve per node.
In other words its great to have a 610 node cluster but if you can do a similar job with 100 nodes that would be better 🙂
@Ivan: typically hadoop type jobs are disk bound as opposed to cpu bound, so they want more disk access “surface area” so to speak. more nodes = more places to read data into mappers in a massively parallel sense.
I keep seeing articles comparing RDBMs to hadoop (and map reduce), and its really apples to oranges. You just dont generally apply each of those to the same problem, although there is some cross-over.
@Curt
Just wanted to add that even though there is a single point of failure the reliability due to software bugs has not been an issue and the dfs Namenode has been very stable. The Jobtracker crashes that we have seen are due to errant jobs – job isolation is not yet that great in hadoop and a bad query from a user can bring down the tracker (though the recovery time for the tracker is literally a few minutes). There is some good work happening in the community though to address those issues.
@Ivan
Thought the average CPU usage is low, there are many times during the day when we are maxed out on CPU, so we do end up using the entire compute capacity of the cluster during peak hours.
2 recs on Dice currently at Ft Meade requiring Hadoop and security clearances, one at LLNL.
Actually, Yahoo has 24,000 Hadoop computers at this point in clusters of up to 3000 computers.
I also believe that Quantcast is the second largest installation. Four months ago they were at 1000 nodes…
For the Hive people — Can Hive survive the failure of query processing nodes during a long running query? Is this handled by restarting the query, or just the step running on the failed nodes?
For Curt — How do the commercial MPP vendors handle this. I will guess that AsterData and NeoView don’t have to restart the query, but others may.
Mark,
A core idea of MapReduce is that if a node fails, you automagically send its work to another node or nodes and keep going. How much of a slowdown this amounts to would, I guess, depend on the nature and design of the query or task being done, and how you manage your spare copy of the data.
Offhand, I can’t think of a reason why the answer would be any different for Aster or Greenplum (did you mean Greenplum when you wrote Neoview?) than for Hadoop, when doing MapReduce.
Or if your question wasn’t about MapReduce, but rather what happens to a query when a node fails in an MPP SQL DBMS — hmmm. I imagine the step in the query plan that failed would have to be redone, on the mirrored copy of the data. The natural follow-up questions are to ask how badly performance degrades in case of a node failure, and how long the bad degradation lasts, and how much degradation remains after data is redistributed.
The answers to that turn out to vary a lot on a case-by-case basis. For example, on Vertica the mirrored copy of a column is likely to per sorted differently than the copy you were using for the query, so the performance hit varies a lot on a query-by-query basis. Row-based systems can usually be tuned for your choice of how many different nodes a single node’s data is mirrored across. And they also aren’t uniform in having primary data on the fast outer tracks and mirrored data on the slow inner tracks. (I forget at the moment why anybody would ever do it any other way, but at least one vendor told me there was a reason.)
CAM
[…] Facebook, Hadoop, and Hive | DBMS2 — DataBase Management System Services Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and […]
thanks for the post Curt!
@Ivan – regarding utilization – Facebook’s average cpu utilization for the cluster is more than 50%. This is partly the reason why the cluster is being expanded. Data Throughput per node depends a lot on number of concurrent tasks, whether the data is local or not and the nature of computation that is being performed. Isolated controlled runs look very different from a real-life environments with heterogenous task mixes. At this point Yahoo’s Tera-Sort runs on Hadoop are the best reference point on performance.
@Mark – at this point Hive queries don’t restart from the last completed map-reduce step on failure. I have opened a Jira: https://issues.apache.org/jira/browse/HIVE-480 to track this.
To add to the post:
a) Open source really rocks. In addition to the work on scheduling by Matei – there have been many instances where we have been able to modify the core hadoop code base for our environment:
– the ongoing work by Dhruba to support archival storage (with limited bandwidth) transparently.
– critical patches to not allow runaway memory consumption by job-tracker
– being able to try and fix problems with the bzip2 codec in Hadoop
to just name a few.
b) regarding availability. One of the things we love about Hadoop is how it keeps chugging along even when half the system is down. As i mentioned to Curt – ingesting new log data rapidly is critical for Facebook – and in that sense it’s great that the hadoop file system can keep accepting new data at a time when it may otherwise be incomplete/corrupt because of missing nodes.
c) In addition to sql compiler and execution engine – Hive also provides a metastore that tracks the various data units (tables/partitions,their schemas/formats and other properties) and this is invaluable for even pure map-reduce programs in terms of data addressability.
d) One of the more novel aspects about the SQL that Hive provides is the capability to inject map and reduce transforms (via scripts) in the query (in addition to standard SQL clauses). This is one of the reasons why it has become so popular amongst data mining engineers at Facebook (since it gives them the ability to write custom logic in the language of their choice).
Both Hadoop and Hive keep improving rapidly in the area of reliability, scalability and performance (yay open source community!) and this is one the other things that makes Hadoop/Hive relatively future proof (and help companies like Facebook decide to use them).
Finally – the successful deployment of Hive and Hadoop at Facebook would not have been possible without all the wonderful engineers there – who have guided Hive feature development as well as incurred the pain of living on the bleeding edge.
Joydeep,
Thanks for your help, including all the additional info you just posted!
I must admit to confusion on the re-starting question. Are you saying that if a node goes down, the whole task has to restart? If so, is this common in MapReduce tasks, or is the issue Hive-specific?
Thanks!
CAM
@Curt – for a single map-reduce job – Hadoop takes care of retrying individual tasks (in case they fail). This covers the case of node failures for a single map-reduce job. However – the entire cluster may go down (since as we noted – job tracker and namenodes are single points of failure and today there’s no transparent failover for these components). All currently running map-reduce jobs will fail in that case.
A single Hive query may require multiple map-reduce jobs. Today – if one of the map-reduce jobs fails (for instance if the cluster goes down – or there are connectivity issues with the cluster) – then the entire query aborts. This is bad because some of the expensive map-reduce jobs may already have been completed and we should only re-submit the ones that have not been (hence the JIRA). Obviously – this doesn’t happen that often.
Having said that – there’s also some ongoing work in the Hadoop codebase as well on making map-reduce jobs survive cluster failures that may be relevant (and that i am not fully abreast of).
Curt – My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?
Mark,
Let’s find out!
[…] Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post: My question is about how commercial MPP RDBMS vendors recover from single or a small number of […]
[…] Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their […]
@Mark
The Aster nCluster MPP RDBMS is founded on recovery-oriented computing principles and the user will not have to restart any read-only queries in the event of a failure. More on our blog:
http://tinyurl.com/6fcuj4 or datasheet: http://www.asterdata.com/resources/downloads/datasheets/Aster_nCluster_Availability.pdf
> Few of the new low-cost MPP DBMS vendors have production systems even today of >100 nodes.
MySpace has >200 nodes in production with Aster nCluster. Video about it here: http://tinyurl.com/oz39bf
Part of the reason Hadoop is better at fine-grained failure recovery than a typical parallel DBMS is that Hadoop materializes the results of each operator (e.g. map output is materialized on disk before being sent to the reducers). A parallel DB using pipelined query execution might likely get better performance, at the tradeoff of making fine-grained error recovery more complicated.
@Steve
I think the basic principals described int he Aster datasheet are similar to hadoop. Have a replica (or two in case of hadoop in order to withstand rack failures) and then fail over the query fragment to the replica if anything happens to one of the compute nodes. Hadoop does that very well as Joydeep mentioned in his comments and in fact the way Hive is structured, Hadoop and Hive are able to withstand such failures for DMLs as well and not just read only queries. Steve, is that true for Aster as well?
Node failures, transient failures are not a problem in Hadoop/Hive and Hadoop solves that systemically. Where things become a bit week are the central nodes that do the job submission (Job tracker) or the central node that maintains file system metadata (Namenode). By having a backup of these nodes, it is entirely possible to protect and the cluster against any hardware failures for these nodes as well.
The lack of query isolation in Hadoop/Hive however, does mean that a bad query (e.g. some query using say 16GB memory on nodes having 8GB nodes in each of its reduce jobs and running on say the entire cluster can make the compute nodes and the file system nodes swap and thus bring down the cluster). Most of the availability problems that we have seen have been due to really bad queries having unintended side effects – specially on the memory usage side. We have not had problems withstanding hardware failures or compute node failures.
[…] Facebook, Hadoop, and Hive | DBMS2 — DataBase Management System Services – Ashish Thusoo and Joydeep Sarma of Facebook contacted me to expand upon and in a couple of instances correct what Jeff had said. They also filled me in on Hive, a data-manipulation add-on to Hadoop that they developed and […]
[…] little topic didn’t make it into my long post on Facebook’s Hadoop/Hive-based data warehouse: Compression. The story seems to […]
[…] There’s been a lot of buzz about Hadoop lately. Just the other day, some of our friends at Yahoo! reclaimed the terasort record from Google using Hadoop, and the folks at Facebook let on that they ingest 15 terabytes a day into their 2.5 petabyte Hadoop-powered data warehouse. […]
There’s a very interesting continuum of approaches here – as best described in a recent blog post by Joe Hellerstein.
http://databeta.wordpress.com/2009/05/14/bigdata-node-density/
A few choice extracts:
– Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. (I describe this to my undergrads as the “regurgitation approach” to fault tolerance.) By contrast, classic MPP database approaches (like Graefe’s famous Exchange operator) are wildly optimistic and pipeline everything, requiring restarts of deep dataflow pipelines in the case of even a single fault.
– The Google MapReduce pessimistic fault model requires way more machines, but the more machines you have, the more likely you are to see a fault, which will make you pessimistic….
– It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?
That last point hits the nail on the head. If a query would run for 1 minute without ‘regurgitation’ and 40 minutes with it (or require 40x the hardware), you’d probably be better off just running it straight and allow the query to automatically restart if it fails. For longer running queries, a very selective amount of mid-query checkpointing (i.e. not full regurgitation at every step) could start to makes sense, but finding the right balance is really an optimization problem based on the expected runtime and characteristics of the query. If only we had a smart query optimizer that we could use to make those kind of decisions… 🙂
[…] Facebook, Hadoop, and Hive […]
[…] Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation. Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS… […]
[…] are a better fit for cloud computing and extreme scale-out on failure-prone commodity hardware. Facebook made that case to me. However, I have trouble thinking of very many enterprise scenarios where it […]
[…] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]
[…] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]
[…] as highly compressed columnar storage and smart caching schemes), the basic approach has been to rely on increasing amounts of hardware to solve ever-bigger problems. Such systems have not addressed the fundamental mismatch between batch-oriented processing and the […]
[…] their analytical database system and claiming data warehouses of size more than a petabyte (see the end of this write-up for some links to large data warehouses).The second trend is what I talked about in my last blog […]
Hadoop POC Environment…
Overview The Software Development team is currently researching a technology solution for addressing our ingestion and processing of weblog data that is reaching the limits of the current system based upon Informatica and Oracle RDBMS…….
[…] 10 terabytes/day of data ingested via Hadoop (Edit: Some of these metrics have been updated in a subsequent post about Facebook.) […]
[…] we discussed this subject a few months ago in a couple of comment threads, it seemed to be the case […]
[…] to a relational DBMS; many others just leave it in HDFS (Hadoop Distributed File System), e.g. with Hive as the query language, or in exactly one case […]
[…] Language) for Hadoop, which is however getting a SQL interface. And we talked about Pig vs. Hive. But I recently heard a rumor all that is in flux, so I won’t write it up […]
[…] utilizes the open source Hadoop/Hive system, which “ingests 15 terabytes of new data per day”. The amount of data coming in, telling details like favorite music, locations, books, thoughts and […]
[…] Big ones — retail-oriented ones (eBay, Amazon) partially excepted — rolled their own technology stacks […]
How is the load balancing to different nodes assured? Does Hadoop internally handles it or we need to define a distribution mechanism?
Hadoop has some level of load balancing. That’s essential to any implementation of MapReduce. If it’s not good enough for you — well, Hadoop is open source, so dive in and change it.
Sorry Hit enter too soon. Has anyone tried Datameers excel frontend to Hadoop. Makes life easier for it and the end users. Free copy on the website. Give it a chance. You wont be disappointed.
[…] hat vermutlich den größten Hadoop Cluster – in diesem Blogpost sind ein paar Zahlen und Fakten genannt. Beeindruckend auf alle […]
[…] Facebook has the main Hadoop Cluster – in this Blogpost you will found some numbers and facts. […]
[…] http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ […]
[…] http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/ […]