byzantine fault detection on charm++ clusters dmitry mogilevsky national center for supercomputing...

21
Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois ([email protected])

Upload: lee-gilbert

Post on 14-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Byzantine Fault Detection on Charm++ Clusters

Dmitry MogilevskyNational Center for Supercomputing Applications

University of Illinois([email protected])

Page 2: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Overview

● Introduction – Faults● Motivation● Algorithm Overview● Results● Future Work

Page 3: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Introduction – Fault Models

● Three hardware failure models.– Fail-Stop Faults

● Binary Model.● Either a piece of hardware works or it doesn't.

– Fail-Stutter Faults● Refines Fail-Stop by addressing degree of failure.

– Byzantine Faults● Faults are completely arbitrary and have no easily

discernible effect on performance.

Page 4: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Byzantine Faults

● Many good reasons for Byzantine fault detection– Avoid errors in computation– Early warning for total hardware failure– Easier to handle – job needs to be migrated, as

opposed to recovered from a checkpoint. – Most importantly, Charm++ already has the

facilties for handling computational faults through migration and checkpointing which can be applied to processors exhibiting these faults[1].

Page 5: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Detecting Byzantine Faults

● Current byzantine tolerance efforts– Redundancy in computation/data– Retrofitting user level algorithms for error detection[2,3]

● Previous work suggests consensus as a good way of doing Byzantine error detection[4]

● However, continuous consensus analysis over the entire data set (replication) is not efficient

● Instead, test consensus only on some of the data, and only some of the time.

Page 6: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Challenge Based Fault Detection

● Basic idea - Detect faulty processors by periodically issuing challenges and determining whether nodes respond correctly.

● Use some metric to track how ‘reliable’ each node is.

● Key desirable features:– As transparent to the user as possible.– Low error-free overhead.

Page 7: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Challenge Based Fault Detection

● The algorithm is based on Lamport's solution to the Byzantine General's problem [4].– Byzantine General's solution message passing

constraints:1.Each message sent is delivered correctly.2.Receiver of the message knows who sent it.3.Absence of a message can be detected.

– Constraint #1 is satisfied by assuming a reliable interconnect

– Constraint #2 and #3 are satisfied as part of the algorithm.

Page 8: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● ChareGroup launched at job start-up and maintained until job reaches termination condition

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Chares

Page 9: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● Chare Group launched at job start-up and maintained until job reaches termination condition

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Byzantine DetectionChareGroup

Page 10: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● Initially, chare 0 starts a challenge round by invoking entry method startRound()

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Page 11: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● startRound() generates a random challenge (matrix of floats and ints) and broadcasts it to the chares by invoking performChallenge(msg) entry method on them.

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Msg[..] Msg[..] Msg[..]

Page 12: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● performChallenge() uses the random data as an input to a computational challenge. Each chare computes the challenge and passes the result back by invoking returnResponse() on the initiating chare.

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Resp Resp Resp

Page 13: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● Once the initiating chare receives responses from all the chares, or a specified amount of time passes, it calls summarizeRound().

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

Page 14: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● summarizeRound() performs consensus analysis. If all nodes have consensus in their results, all nodes are assumed to be operating correctly. If full consensus does not exist, minority is assumed to be wrong.

● Updates are performed to reliability tables based on consensus results.

Page 15: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Implementation in Charm++

● The initiating chare crafts a token with next head chare's identity and the updated reliability table and broadcasts it by calling getNewToken(tok) on all chares. The next head chare schedules the next invocation of startRound()

Node Group

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Compute NodeCompute Node

Node Group

!

Tok Tok Tok

Page 16: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Some Correctness Issues

● The algorithm is immune to being rendered invalid by a malfunctioning node.– Complete node loss is accounted for by loosely

synchronized timeouts.– The table of reliability metrics is protected against

corruption by a malfunctioning head chare by keeping a copy of a table from the previous round in each chare and comparing the two for unusually large deviations.

– The effects of malfunctioning hardware on the system is mitigated by the fact that the algorithm is distributed.

Page 17: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Other Key Points

● Can be fully integrated with Charm++– When some condition is reached, chare

scheduler can be alerted to migrate chares off the node deemed as failing

● Flexible– Single node detection algorithm and failure

conditions can be adapted ● Requires very minimal modification to user's

program (one line modification to module file, and one line modification to main() function).

Page 18: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Testing – So far

● Only limited testing done so far, mostly to test performance.– Overhead is quite low (somewhere between 2%

and 5%).– However, far more extensive testing is still

required into many aspects of the system.

Page 19: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

Ongoing Work

● Basic environment is finished, but much work remains to be done– Performance optimization– Better scheduling– Exploration of alternative single-node detection

algorithms– Possible integration with Charm++ fault-

tolerance capabilities.

Page 20: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)

References

1. Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale. “Proactive Fault Tolerance in Large Systems”. HPCRI workshop 05.

2. G.R. Redinbo, R. Manomohan. “Fault-tolerant FFT data compression''. 2000 Pacific Rim International Symposium on Dependable Computing (PRDC'00).

3. Youngbae Kim, James S. Plank, and Jack Dongarra. “Fault Tolerant Matrix Operations Using Checksum and Reverse Computation''. 6th Symposium on the Fontiers of Massively Parallel Computation, Annapolis, MD, October, 1996, pp. 70-77. http://www.cs.utk.edu/~plank/plank/papers/Frontiers96.pdf

4. Tushar Deepak Chandra , Sam Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM (JACM), v.43 n.2, p.225-267, March 1996

5. Lamport, Leslie, Shostak, Robert, Pease, Marshall. “The Byzantine Generals Problem''. ACM Transactions on Programming Language Systems, Vol. 4, No. 3, July 1982, Pages 382-401.

Page 21: Byzantine Fault Detection on Charm++ Clusters Dmitry Mogilevsky National Center for Supercomputing Applications University of Illinois (dmogilev@ncsa.uiuc.edu)