May 12, 2009
How much state is saved when an MPP DBMS node fails?
Mark Callaghan raised an interesting question in the comment thread to my recent Facebook/Hadoop/Hive post:
My question is about how commercial MPP RDBMS vendors recover from single or a small number of node failures during a long running SQL query. Do any of them save enough state to avoid starting the query over?
Honestly, I’d just be guessing at the answer.
Would any vendors or other knowledgeable folks care to take a crack at answering directly?
Comments
10 Responses to “How much state is saved when an MPP DBMS node fails?”
Leave a Reply
A MPP RDBMS built by Chuck Norris never experiences node failures.
If a node fails in a clique there is a performance plenty of any query or load job. Once the node is recovered, it will join the MPP.
It is important to know what level of fault tolerance is built in place.
Steve Wooledge of Aster took a crack at this question over in the Facebook/Hadoop/Hive thread.
LOL, Robert!
As far as DB2 for LUW is concerned if a query fails in an DPF environment it needs to get restarted. Whether the reason is a node failure or something else is irrelevant.
No intermediate results are rescued – with the obvious exception of a primed bufferpool of course.
Roughly speaking, most MPP databases use dataflow pipelines via Graefe’s famous Exchange model. Those pipelines reflect an extremely optimistic view or reliability, and expensive restarts of deep dataflow pipelines in the case of even a single fault.
By contrast, Hadoop (as per the Google MapReduce paper) is wildly pessimistic, checkpointing the output of every single Map or Reduce stage to disks, before reading it right back in. As a result, it’s easy to construct cases where a traditional MPP DB would do no more I/O than the scan of the inputs, whereas a Hadoop job might need to write and reread stages of the pipeline over and over. (I describe this to my undergrads as vomiting the data to disk just to immediately swallow it back up.)
The best answer probably lies either in between, or in an entirely different approach. More on that question in this post.
Joe, everything you’re saying sounds right.
But why would Hadoop have to do reread “over and over”? I thought it only reads that data if it’s recovering from a crash.
Also, is it possible that the Hadoop node writing the checkpoint could be doing the disk writing and its computation in parallel, which would reduce the cost of the checkpointing to at least some, and perhaps a great, degree?
Daniel:
Google’s MR paper (which Hadoop folks follow closely) sez:
1) Mappers write outputs to their local disk. (And some Map jobs are done redundantly on >1 machine to mitigate “stragglers”)
2) Reducers fetch from mappers over the network some time later
3) Reducers write their outputs to the distributed filesystem (triply replicated)
So in terms of resource consumption (energy, disk utilization) that’s a lot of I/O.
Your points are on target: some reads can be absorbed by filesystem cache, and there is overlap of CPU, disk I/O, and network I/O. The latter only affects completion time, not resource consumption.
J
Currently when a Vertica node fails the cluster will cancel any statements that are in flight and the user is immediately able to re-run them. Transactions that are in flight get preserved (i.e. we keep transaction state). The user receives a statement level error, similar to what they would get in the case of a lock timeout or if there is a policy based quota or resource rejection. Feedback from our users is that this model makes development very simple. They don’t need to handle any special cases that a node has failed. When a statement fails they can just re-run it.
Of course if the machine you are connected to fails then the transactions that it initiated are rolled back and the user needs to connect to a new machine. Since all nodes in a Vertica cluster are peers, users can connect to any node and new connections are rerouted to a live node automatically when using load balancing software.
With Netezza when a SPU fails during query execution all select statements which have not started returning data get restarted from the beginning.
Any data loads are killed and must be restarted manually.
Select statements which have started returning data are killed and must be restarted manually.
Fortunately failures are not very common so it is rarely an issue.