“revisiting fault diagnosis agreement in a new territory” s. c. wang and k. q. yan operating...
TRANSCRIPT
“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p.
41– 61. An extension of the Byzantine General’s
algorithm – and hot off the press
Agreement Problem
In the Byzantine General problem there is a commanding general that issues an “order” and all loyal lieutenant generals must come to the same agreement on the order.
A related subproblem is the consensus problem – each processor, which has its own initial value, has to communicate with all other processors to reach a common value among the healthy processors.
Consensus constraints
All the healthy processors agree on the common value (Consensus)
If there exists a common initial value v_i among ALL the processors, then all the healthy processors must agree on v_i
Most protocols for solving Byzantine Agreement or consensus are fault-masking protocols – come to consensus without the fault affecting the outcome.
Fault Diagnosis Agreement (FDA)
Goal is to make each healthy processor able to detect and locate the faulty components in the distributed system
ALL the healthy processor identify the common set of faulty components in the process of reaching consensus (Agreement)
No healthy component is falsely detected as faulty by any healthy processor (Fairness)
Paper assumes dual failure mode on the network
Most previous papers assume that the faulty components are processors only and that the network is fault-free. Here we assume that the processors are fault-free
and that the network may have a fault. Also, most other papers assume that the fault
is malicious only. Here we assume dual failure: Malicious faults (a random value is sent), and Dormant faults (no value/crash or a stuck-at value
is sent). Assume that a healthy process can detect components with dormant faults.
Assumptions
A synchronous distributed system whose processors are reliable during the protocol execution
Some faults, crash, stuck-at, noise or an intruder may interfere with message transmission
N-processor fully connected network, with m malicious faults, d dormant faults,
m<=ceiling[(n-d-3)/2]
Dual Fault Detection Consensus (DFDC) Algorithm Three phases:
Message exchange phase Decision making phase Fault detection phase
Message exchange phase and the decision making phase is (similar to) OM(1) in the Byzantine General paper. This results in a matrix of information at each processor, MAT_i, which is used to construct a majority vector, MAJ_i
Fault detection phase
Each processor sends every other processor its MAT_i. The MAT_i is used to find the faults by each healthy processor i: Take the majority value in each position of the matrix
to get FDMAT_i If no majority exists for the i,jth position, use the
negative value of the i,jth position of the MAT_j that was sent
P2=0
P4=1 P5=1
P3=0
P1=0
dormant faulty
malcious faulty
Initialvalue
V1 0
V2 0
V3 0
V4 1
V5 1
V1 V2 V3 V4 V5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
Vectors received after the first round
P2=0
P4=1
P5=1
P3=0
P1=0
dormant faulty
malcious faulty
V1 V2 V3 V4 V5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
Vectors received after the first round
0 0 0 1 x 0
0 0 0 1 x 0
0 0 0 0 x 0
0 1 1 1 x 1
x 1 1 0 x 1
MAT_1 MAJ_1
P2=0
P4=1
P5=1
P3=0
P1=0
dormant faulty
malcious faulty
V1 V2 V3 V4 V5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
Vectors received after the first round
0 0 0 1 x 0
0 0 0 0 0 0
0 0 0 0 0 0
0 1 1 1 1 1
x 1 1 1 1 1
MAT_2,3 MAJ_2,3
P2=0
P4=1
P5=1
P3=0
P1=0
dormant faulty
malcious faulty
V1 V2 V3 V4 V5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
Vectors received after the first round
0 0 0 1 x 0
1 0 0 0 0 0
1 0 0 0 0 0
1 1 1 1 1 1
0 1 1 1 1 1
MAT_4 MAJ_4
P2=0
P4=1
P5=1
P3=0
P1=0
dormant faulty
malcious faulty
V1 V2 V3 V4 V5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
Vectors received after the first round
x 0 0 1 x 0
X 0 0 0 0 0
X 0 0 0 0 0
X 1 1 1 1 1
X 1 1 1 1 1
MAT_5 MAJ_5
0 0 0 1 x
0 0 0 1 x
0 0 0 0 x
0 1 1 1 x
x 1 1 0 x
MAT from P1
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
MAT from P2
0 0 0 1 X
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
MAT from P3
1 1 0 0 1
0 0 0 0 0
1 0 0 1 0
0 0 1 0 0
0 0 1 1 1
MAT from P4
X X X X X
X X X X X
X X X X X
X X X X X
X X X X X
MAT from P5
0 0 0 1 x
0 0 0 0 0
0 0 0 0 0
0 1 1 1 1
x 1 1 1 1
FDMAT
Fault detection phase with processor P1