“revisiting fault diagnosis agreement in a new territory” s. c. wang and k. q. yan operating...

14
“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the Byzantine General’s algorithm – and hot off the press

Upload: samuel-perry

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p.

41– 61. An extension of the Byzantine General’s

algorithm – and hot off the press

Page 2: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Agreement Problem

In the Byzantine General problem there is a commanding general that issues an “order” and all loyal lieutenant generals must come to the same agreement on the order.

A related subproblem is the consensus problem – each processor, which has its own initial value, has to communicate with all other processors to reach a common value among the healthy processors.

Page 3: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Consensus constraints

All the healthy processors agree on the common value (Consensus)

If there exists a common initial value v_i among ALL the processors, then all the healthy processors must agree on v_i

Most protocols for solving Byzantine Agreement or consensus are fault-masking protocols – come to consensus without the fault affecting the outcome.

Page 4: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Fault Diagnosis Agreement (FDA)

Goal is to make each healthy processor able to detect and locate the faulty components in the distributed system

ALL the healthy processor identify the common set of faulty components in the process of reaching consensus (Agreement)

No healthy component is falsely detected as faulty by any healthy processor (Fairness)

Page 5: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Paper assumes dual failure mode on the network

Most previous papers assume that the faulty components are processors only and that the network is fault-free. Here we assume that the processors are fault-free

and that the network may have a fault. Also, most other papers assume that the fault

is malicious only. Here we assume dual failure: Malicious faults (a random value is sent), and Dormant faults (no value/crash or a stuck-at value

is sent). Assume that a healthy process can detect components with dormant faults.

Page 6: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Assumptions

A synchronous distributed system whose processors are reliable during the protocol execution

Some faults, crash, stuck-at, noise or an intruder may interfere with message transmission

N-processor fully connected network, with m malicious faults, d dormant faults,

m<=ceiling[(n-d-3)/2]

Page 7: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Dual Fault Detection Consensus (DFDC) Algorithm Three phases:

Message exchange phase Decision making phase Fault detection phase

Message exchange phase and the decision making phase is (similar to) OM(1) in the Byzantine General paper. This results in a matrix of information at each processor, MAT_i, which is used to construct a majority vector, MAJ_i

Page 8: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

Fault detection phase

Each processor sends every other processor its MAT_i. The MAT_i is used to find the faults by each healthy processor i: Take the majority value in each position of the matrix

to get FDMAT_i If no majority exists for the i,jth position, use the

negative value of the i,jth position of the MAT_j that was sent

Page 9: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

P2=0

P4=1 P5=1

P3=0

P1=0

dormant faulty

malcious faulty

 Initialvalue

V1 0

V2 0

V3 0

V4 1

V5 1

V1 V2 V3 V4 V5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

Vectors received after the first round

Page 10: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

P2=0

P4=1

P5=1

P3=0

P1=0

dormant faulty

malcious faulty

V1 V2 V3 V4 V5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

Vectors received after the first round

0 0 0 1 x 0

0 0 0 1 x 0

0 0 0 0 x 0

0 1 1 1 x 1

x 1 1 0 x 1

MAT_1 MAJ_1

Page 11: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

P2=0

P4=1

P5=1

P3=0

P1=0

dormant faulty

malcious faulty

V1 V2 V3 V4 V5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

Vectors received after the first round

0 0 0 1 x 0

0 0 0 0 0 0

0 0 0 0 0 0

0 1 1 1 1 1

x 1 1 1 1 1

MAT_2,3 MAJ_2,3

Page 12: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

P2=0

P4=1

P5=1

P3=0

P1=0

dormant faulty

malcious faulty

V1 V2 V3 V4 V5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

Vectors received after the first round

0 0 0 1 x 0

1 0 0 0 0 0

1 0 0 0 0 0

1 1 1 1 1 1

0 1 1 1 1 1

MAT_4 MAJ_4

Page 13: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

P2=0

P4=1

P5=1

P3=0

P1=0

dormant faulty

malcious faulty

V1 V2 V3 V4 V5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

Vectors received after the first round

x 0 0 1 x 0

X 0 0 0 0 0

X 0 0 0 0 0

X 1 1 1 1 1

X 1 1 1 1 1

MAT_5 MAJ_5

Page 14: “Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the

0 0 0 1 x

0 0 0 1 x

0 0 0 0 x

0 1 1 1 x

x 1 1 0 x

MAT from P1

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

MAT from P2

0 0 0 1 X

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

MAT from P3

1 1 0 0 1

0 0 0 0 0

1 0 0 1 0

0 0 1 0 0

0 0 1 1 1

MAT from P4

X X X X X

X X X X X

X X X X X

X X X X X

X X X X X

MAT from P5

0 0 0 1 x

0 0 0 0 0

0 0 0 0 0

0 1 1 1 1

x 1 1 1 1

FDMAT

Fault detection phase with processor P1