byzantine fault detection on charm++ clusters dmitry mogilevsky national center for supercomputing...

Byzantine Fault Detection on Charm++ Clusters

Dmitry MogilevskyNational Center for Supercomputing Applications

University of Illinois([email protected])

Overview

● Introduction – Faults● Motivation● Algorithm Overview● Results● Future Work

Introduction – Fault Models

● Three hardware failure models.– Fail-Stop Faults

● Binary Model.● Either a piece of hardware works or it doesn't.

– Fail-Stutter Faults● Refines Fail-Stop by addressing degree of failure.

– Byzantine Faults● Faults are completely arbitrary and have no easily

discernible effect on performance.

Byzantine Faults

● Many good reasons for Byzantine fault detection– Avoid errors in computation– Early warning for total hardware failure– Easier to handle – job needs to be migrated, as

opposed to recovered from a checkpoint. – Most importantly, Charm++ already has the

facilties for handling computational faults through migration and checkpointing which can be applied to processors exhibiting these faults[1].

Detecting Byzantine Faults

● Current byzantine tolerance efforts– Redundancy in computation/data– Retrofitting user level algorithms for error detection[2,3]

● Previous work suggests consensus as a good way of doing Byzantine error detection[4]

● However, continuous consensus analysis over the entire data set (replication) is not efficient

● Instead, test consensus only on some of the data, and only some of the time.

Challenge Based Fault Detection

● Basic idea - Detect faulty processors by periodically issuing challenges and determining whether nodes respond correctly.

● Use some metric to track how ‘reliable’ each node is.

● Key desirable features:– As transparent to the user as possible.– Low error-free overhead.

Challenge Based Fault Detection

● The algorithm is based on Lamport's solution to the Byzantine General's problem [4].– Byzantine General's solution message passing

constraints:1.Each message sent is delivered correctly.2.Receiver of the message knows who sent it.3.Absence of a message can be detected.

– Constraint #1 is satisfied by assuming a reliable interconnect

– Constraint #2 and #3 are satisfied as part of the algorithm.

Implementation in Charm++

● ChareGroup launched at job start-up and maintained until job reaches termination condition

Compute NodeCompute Node




Node Group

Chares


● Chare Group launched at job start-up and maintained until job reaches termination condition





Node Group

Byzantine DetectionChareGroup


● Initially, chare 0 starts a challenge round by invoking entry method startRound()





Node Group


● startRound() generates a random challenge (matrix of floats and ints) and broadcasts it to the chares by invoking performChallenge(msg) entry method on them.





Node Group

Msg[..] Msg[..] Msg[..]


● performChallenge() uses the random data as an input to a computational challenge. Each chare computes the challenge and passes the result back by invoking returnResponse() on the initiating chare.





Node Group

Resp Resp Resp


● Once the initiating chare receives responses from all the chares, or a specified amount of time passes, it calls summarizeRound().





Node Group


● summarizeRound() performs consensus analysis. If all nodes have consensus in their results, all nodes are assumed to be operating correctly. If full consensus does not exist, minority is assumed to be wrong.

● Updates are performed to reliability tables based on consensus results.


● The initiating chare crafts a token with next head chare's identity and the updated reliability table and broadcasts it by calling getNewToken(tok) on all chares. The next head chare schedules the next invocation of startRound()

Node Group





Node Group

!

Tok Tok Tok

Some Correctness Issues

● The algorithm is immune to being rendered invalid by a malfunctioning node.– Complete node loss is accounted for by loosely

synchronized timeouts.– The table of reliability metrics is protected against

corruption by a malfunctioning head chare by keeping a copy of a table from the previous round in each chare and comparing the two for unusually large deviations.

– The effects of malfunctioning hardware on the system is mitigated by the fact that the algorithm is distributed.

Other Key Points

● Can be fully integrated with Charm++– When some condition is reached, chare

scheduler can be alerted to migrate chares off the node deemed as failing

● Flexible– Single node detection algorithm and failure

conditions can be adapted ● Requires very minimal modification to user's

program (one line modification to module file, and one line modification to main() function).

Testing – So far

● Only limited testing done so far, mostly to test performance.– Overhead is quite low (somewhere between 2%

and 5%).– However, far more extensive testing is still

required into many aspects of the system.

Ongoing Work

● Basic environment is finished, but much work remains to be done– Performance optimization– Better scheduling– Exploration of alternative single-node detection

algorithms– Possible integration with Charm++ fault-

tolerance capabilities.

References

1. Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale. “Proactive Fault Tolerance in Large Systems”. HPCRI workshop 05.

2. G.R. Redinbo, R. Manomohan. “Fault-tolerant FFT data compression''. 2000 Pacific Rim International Symposium on Dependable Computing (PRDC'00).

3. Youngbae Kim, James S. Plank, and Jack Dongarra. “Fault Tolerant Matrix Operations Using Checksum and Reverse Computation''. 6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77. http://www.cs.utk.edu/~plank/plank/papers/Frontiers96.pdf

4. Tushar Deepak Chandra , Sam Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM (JACM), v.43 n.2, p.225-267, March 1996

5. Lamport, Leslie, Shostak, Robert, Pease, Marshall. “The Byzantine Generals Problem''. ACM Transactions on Programming Language Systems, Vol. 4, No. 3, July 1982, Pages 382-401.

byzantine fault detection on charm++ clusters dmitry mogilevsky national center for supercomputing...

Documents

charm summarizeround

byzantine faultsfaults

charm startround

charm charegroup

charm performchallenge

charm chare group

byzantine generals problem

computational challenge