© 2006 andreas haeberlen, mpi-sws 1 the case for byzantine fault detection andreas haeberlen...

18
© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel MPI-SWS

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

© 2006 Andreas Haeberlen, MPI-SWS1

The Case for Byzantine Fault Detection

Andreas Haeberlen

MPI-SWS / Rice University

Petr Kouznetsov MPI-SWS

Peter Druschel MPI-SWS

Page 2: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

2© 2006 Andreas Haeberlen, MPI-SWS

Challenge: Byzantine faults Distributed systems are subject to

a variety of failures and attacks Hacker break-in Freeloading Censorship Data corruption Software/hardware failure

Byzantine failure model: Faulty nodes may exhibit arbitrary behavior

Dependable systems must be protected against Byzantine faults

Page 3: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

3© 2006 Andreas Haeberlen, MPI-SWS

Existing approach: Fault tolerance

Byzantine fault tolerance (BFT) can mask a limited number of Byzantine faults

Example: Castro and Liskov [OSDI'99]

Client

Serverreplicas

Page 4: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

4© 2006 Andreas Haeberlen, MPI-SWS

Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes

identify the faulty node(s) distribute evidence of the fault

Nodes can isolate the faulty node + initiate recovery

Byzantine Fault Detection

Page 5: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

5© 2006 Andreas Haeberlen, MPI-SWS

Byzantine Fault Detection

Alternative approach: Fault detection Nodes monitor each other for faulty behavior When a fault occurs, the correct nodes

identify the faulty node(s) distribute evidence of the fault

Nodes can isolate the faulty node + initiate recovery

D C

B

A

ESet X=5

D C

A

E

D C

B

A

EOK

X=?X=7 E: X=5

7! B

Page 6: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

6© 2006 Andreas Haeberlen, MPI-SWS

Level3

Best approach depends on the application

Best-effort service Goal: Find faulty components Wide-area delays, limited

bandwidth, many nodes

Air traffic control Inter-domain routing

Failures may be fatal! Goal: Mask fault symptoms Delays negligible, bandwidth

plentiful, few nodes

Machine roomAT&T

Sprint

Typical application for Fault DetectionTypical application for Fault Tolerance

Page 7: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

7© 2006 Andreas Haeberlen, MPI-SWS

Detection can provide accountability In an accountable system:

Actions are undeniable State is tamper-evident Correctness can be certified

Good nodes can provide evidence that they are good

Bad nodes cannot hide evidence of misbehavior

Proven concept in society Banking, administration ...

Desirable for distributed systems [Yumerefendi05] Example: Building trust in federated systems

Page 8: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

8© 2006 Andreas Haeberlen, MPI-SWS

What about performance?

If up to f nodes can be faulty, we need f+1 replicas to guarantee detection (fault tolerance: 3f+1)

More throughput using the same resources Works even when >33% of the nodes can become

faulty

Detection can defer overhead to periods of low load

System can deliver high peak throughput

Detection does not require consensus Potentially less expensive than BFT

Page 9: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

9© 2006 Andreas Haeberlen, MPI-SWS

Outline

Introduction BFD abstraction PeerReview algorithm Conclusion

Page 10: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

10© 2006 Andreas Haeberlen, MPI-SWS

How is BFD used?

Each correct node has state machine + detector Detector can inspect all messages at its local node When detector observes a fault on another node,

it informs its local application, and it provides evidence of the fault to other detectors

?

Application

State machine Detector

Network

Node Xis

faulty!

No assumptionsabout faulty nodes

Page 11: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

11© 2006 Andreas Haeberlen, MPI-SWS

Only observable faults can be detected

Two classes of observable faults: Detectable faultiness: Node breaks the protocol Detectable ignorance: Node refuses to respond

As long as the faulty node continues to follow the protocol, BFD cannot detect this!

Set X=5

OKGet X

5

A B C

Correct

Set X=5

OKGet X

A B CSet X=5

OKGet X

7

A B C

Detectably ignorantDetectably faulty

Page 12: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

12© 2006 Andreas Haeberlen, MPI-SWS

BFD can give strong guarantees Three types of detector output

Trusted, suspected, exposed

Strong completeness "No false negatives"

Strong accuracy "No false positives"

Precise definitions are in the paper

Trusted

Suspected Exposed

Page 13: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

13© 2006 Andreas Haeberlen, MPI-SWS

Outline

Introduction BFD abstraction PeerReview algorithm Conclusion

Page 14: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

14© 2006 Andreas Haeberlen, MPI-SWS

Assumptions

1. Protocol can be modeled as a deterministic state machine

2. Each node has a strong identity, as well as a public/private keypair for signing messages

3. The faulty nodes cannot prevent two correct nodes from communicating break the cryptographic keys

Page 15: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

15© 2006 Andreas Haeberlen, MPI-SWS

Secure logging

All messages are signed and acknowledged Each node keeps a log of all local inputs and outputs Nodes must commit to the contents of their log

Log is tamper-evident [Maniatis02]

Rcv(A, "Set X=5")Send(A, "Okay")Rcv(C, "Get X")Send(C, "5")

Snd(B, "Set X=5")Rcv(B, "Okay")

Snd(B, "Get X")Rcv(B, "5")

B's log

A

B

C

Page 16: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

16© 2006 Andreas Haeberlen, MPI-SWS

Detecting ignorance

If a node refuses to acknowledge a message Send message as evidence to other nodes Correct nodes will challenge the ignorant node to prove

that its log contains a 'Rcv' entry for that message A correct node can always respond

Rcv(A, "Set X=5")Send(A, "Okay")Recv(C, "Get X")

A

B

C

Page 17: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

17© 2006 Andreas Haeberlen, MPI-SWS

Detecting faultiness

Nodes can audit each other's log at any time Auditors replay input in the log, compare output If a divergence is detected

Send log as evidence to other nodes Other nodes can repeat the same procedure to check

whether the node is really faulty (no he-said-she-said!)

Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "7")

A

B

C

B'

Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "5")

State machine B is expected to run

Rcv(A, "Set X=5")Send(A, "Okay")Rcv(B, "Get X")Send(B, "7")

Snap-shots

Page 18: © 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel

18© 2006 Andreas Haeberlen, MPI-SWS

Summary

New approach: Byzantine Fault Detection Alternative to fault tolerance Provides accountability

Fault Detection can give strong guarantees Eventual strong accuracy and completeness

Early results indicate Fault Detection is practical Example: PeerReview algorithm

Thank you!