mikel larrea distributed systems group university of the basque country, upv/ehu

UPV / EHU

Distributed Algorithms forFailure Detection and Consensus in

Crash, Crash-Recovery andOmission Environments

Mikel Larrea

Distributed Systems GroupUniversity of the Basque Country, UPV/EHU

2

UPV / EHU

Mikel Larrea − Mannheim, May 2011

Context and Seminal Papers• In the Consensus problem, all correct processes

propose a value and must reach a unanimous and irrevocable decision on some proposed value

• [FLP85] M. Fischer, N. Lynch, M. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 1985

• [CT96] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 1996

• [CHT96] T. Chandra, V. Hadzilacos, S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 1996

3

UPV / EHU


Motivation

4

UPV / EHU


Motivation++

(Zurich, July 2010)

5

UPV / EHU


Crash Failure Detectors [CT96]

6

UPV / EHU


Strengthening Completeness

7

UPV / EHU


Guest Stars: P and Omega

P: strong completeness, eventual strong accuracy– Eventually every process that crashes is

permanently suspected by every correct process– There is a time after which correct processes are

not suspected by any correct process

• Omega satisfies the following property:– There is a time after which all the correct

processes always trust the same correct process

• What is a correct process?– It depends on the failure model :-)

8

UPV / EHU


FD-based Consensus

9

UPV / EHU


Fault-tolerant Architecture

10

UPV / EHU


Outline• Part I: Crash Environments

– (Near-) Communication-efficient algorithms for P– Communication-optimal algorithms for P

• Part II: Crash-Recovery Environments– Implementing Omega with/without stable storage– Communication-efficient algorithms for Omega– From Omega to P– Fault-tolerant aggregator election and data aggregation

in wireless sensor networks

• Part III: Omission Environments– Secure failure detection and consensus in TrustedPals– Communication-efficient algorithm for P

UPV / EHU

Part I:

P in Crash Environments

Joint work withRoberto Cortiñas, Alberto Lafuente, Iratxe Soraluze, Joachim Wieland

12

UPV / EHU


The First P Algorithm [CT96]

13

UPV / EHU


Part I. Summary of Results

• Efficient implementations of P– Nearly communication-efficient algorithms (n+C

links are used forever) Q-based, transformations

– Communication-efficient algorithms (n links)• Pure ring-based, optimizations

• Optimal implementations of P– Communication-optimal algorithms (C links)

• RBcast-based, one-to-one, one-to-all

14

UPV / EHU


Reliable Broadcast [CT96]“All correct processes deliverthe same set of messages”

15

UPV / EHU


P in Crash Environments

• [WLL07] J. Wieland, M. Larrea, A. Lafuente. An evaluation of ring-based algorithms for the Eventually Perfect failure detector class. 15th International Conference on Parallel, Distributed and Network-based Processing, 2007

• [LSCL08] M. Larrea, I. Soraluze, R. Cortiñas, A. Lafuente. An Evaluation of Communication-Optimal P Algorithms. 16th International Conference on Parallel, Distributed and Network-based Processing, 2008

UPV / EHU

Joint work withJosé Javier Astrain, Ernesto Jiménez,

Cristian Martín, Iratxe Soraluze

Part II:

Omega in Crash-Recovery Environments

17

UPV / EHU


Part II. Summary of Results• Redefinition of Omega

– Take into account unstable processes– Take into account the availability of stable

storage

• Implementation of Omega– With and without stable storage– Efficient algorithms

• From Omega to P

• Fault-tolerant aggregator election and data aggregation in wireless sensor networks

18

UPV / EHU


From Omega to P

UPV / EHU

Joint work withRoberto Cortiñas, Felix Freiling, Marjan

Ghajar-Azadanlou, Alberto Lafuente, Lucia Penso, Iratxe Soraluze

Part III:

P in Omission Environments

20

UPV / EHU


Part III. Summary of Results

• Reduction from Byzantine to omission– Processes are equipped with tamper proof

security modules (e.g., smartcards)

• Actually, omission + buffering/timing attacks

• Omission models– send | receive | general– permanent | transient– non-selective | selective

21

UPV / EHU


Part III. Summary of Results• Impossibility result

P is impossible to implement in the (transient) general omission model

• Redefinition and implementation of P– In-connected and out-connected processes– All-to-all communication, sequence numbers,

connectivity matrix

P-based Consensus– Termination: every in-connected process

eventually decides– Adaptation of Chandra-Toueg’s algorithm

UPV / EHU

Distributed Algorithms forFailure Detection and Consensus in

Crash, Crash-Recovery andOmission Environments

Mikel Larrea

Distributed Systems GroupUniversity of the Basque Country, UPV/EHU

Thank [email protected]

mikel larrea distributed systems group university of the basque country, upv/ehu

Documents