byzantine fault detection on charm++ clusters dmitry mogilevsky national center for supercomputing...
TRANSCRIPT
Byzantine Fault Detection on Charm++ Clusters
Dmitry MogilevskyNational Center for Supercomputing Applications
University of Illinois([email protected])
Overview
● Introduction – Faults● Motivation● Algorithm Overview● Results● Future Work
Introduction – Fault Models
● Three hardware failure models.– Fail-Stop Faults
● Binary Model.● Either a piece of hardware works or it doesn't.
– Fail-Stutter Faults● Refines Fail-Stop by addressing degree of failure.
– Byzantine Faults● Faults are completely arbitrary and have no easily
discernible effect on performance.
Byzantine Faults
● Many good reasons for Byzantine fault detection– Avoid errors in computation– Early warning for total hardware failure– Easier to handle – job needs to be migrated, as
opposed to recovered from a checkpoint. – Most importantly, Charm++ already has the
facilties for handling computational faults through migration and checkpointing which can be applied to processors exhibiting these faults[1].
Detecting Byzantine Faults
● Current byzantine tolerance efforts– Redundancy in computation/data– Retrofitting user level algorithms for error detection[2,3]
● Previous work suggests consensus as a good way of doing Byzantine error detection[4]
● However, continuous consensus analysis over the entire data set (replication) is not efficient
● Instead, test consensus only on some of the data, and only some of the time.
Challenge Based Fault Detection
● Basic idea - Detect faulty processors by periodically issuing challenges and determining whether nodes respond correctly.
● Use some metric to track how ‘reliable’ each node is.
● Key desirable features:– As transparent to the user as possible.– Low error-free overhead.
Challenge Based Fault Detection
● The algorithm is based on Lamport's solution to the Byzantine General's problem [4].– Byzantine General's solution message passing
constraints:1.Each message sent is delivered correctly.2.Receiver of the message knows who sent it.3.Absence of a message can be detected.
– Constraint #1 is satisfied by assuming a reliable interconnect
– Constraint #2 and #3 are satisfied as part of the algorithm.
Implementation in Charm++
● ChareGroup launched at job start-up and maintained until job reaches termination condition
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Chares
Implementation in Charm++
● Chare Group launched at job start-up and maintained until job reaches termination condition
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Byzantine DetectionChareGroup
Implementation in Charm++
● Initially, chare 0 starts a challenge round by invoking entry method startRound()
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Implementation in Charm++
● startRound() generates a random challenge (matrix of floats and ints) and broadcasts it to the chares by invoking performChallenge(msg) entry method on them.
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Msg[..] Msg[..] Msg[..]
Implementation in Charm++
● performChallenge() uses the random data as an input to a computational challenge. Each chare computes the challenge and passes the result back by invoking returnResponse() on the initiating chare.
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Resp Resp Resp
Implementation in Charm++
● Once the initiating chare receives responses from all the chares, or a specified amount of time passes, it calls summarizeRound().
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
Implementation in Charm++
● summarizeRound() performs consensus analysis. If all nodes have consensus in their results, all nodes are assumed to be operating correctly. If full consensus does not exist, minority is assumed to be wrong.
● Updates are performed to reliability tables based on consensus results.
Implementation in Charm++
● The initiating chare crafts a token with next head chare's identity and the updated reliability table and broadcasts it by calling getNewToken(tok) on all chares. The next head chare schedules the next invocation of startRound()
Node Group
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Compute NodeCompute Node
Node Group
!
Tok Tok Tok
Some Correctness Issues
● The algorithm is immune to being rendered invalid by a malfunctioning node.– Complete node loss is accounted for by loosely
synchronized timeouts.– The table of reliability metrics is protected against
corruption by a malfunctioning head chare by keeping a copy of a table from the previous round in each chare and comparing the two for unusually large deviations.
– The effects of malfunctioning hardware on the system is mitigated by the fact that the algorithm is distributed.
Other Key Points
● Can be fully integrated with Charm++– When some condition is reached, chare
scheduler can be alerted to migrate chares off the node deemed as failing
● Flexible– Single node detection algorithm and failure
conditions can be adapted ● Requires very minimal modification to user's
program (one line modification to module file, and one line modification to main() function).
Testing – So far
● Only limited testing done so far, mostly to test performance.– Overhead is quite low (somewhere between 2%
and 5%).– However, far more extensive testing is still
required into many aspects of the system.
Ongoing Work
● Basic environment is finished, but much work remains to be done– Performance optimization– Better scheduling– Exploration of alternative single-node detection
algorithms– Possible integration with Charm++ fault-
tolerance capabilities.
References
1. Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale. “Proactive Fault Tolerance in Large Systems”. HPCRI workshop 05.
2. G.R. Redinbo, R. Manomohan. “Fault-tolerant FFT data compression''. 2000 Pacific Rim International Symposium on Dependable Computing (PRDC'00).
3. Youngbae Kim, James S. Plank, and Jack Dongarra. “Fault Tolerant Matrix Operations Using Checksum and Reverse Computation''. 6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77. http://www.cs.utk.edu/~plank/plank/papers/Frontiers96.pdf
4. Tushar Deepak Chandra , Sam Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM (JACM), v.43 n.2, p.225-267, March 1996
5. Lamport, Leslie, Shostak, Robert, Pease, Marshall. “The Byzantine Generals Problem''. ACM Transactions on Programming Language Systems, Vol. 4, No. 3, July 1982, Pages 382-401.