comp 655: distributed/operating systems summer 2011 dr. chunbo chu week 7: fault tolerance...

COMP 655:Distributed/Operating

SystemsSummer 2011

Dr. Chunbo ChuWeek 7: Fault Tolerance

04/20/23 1Distributed Systems - COMP 655

04/20/23 Distributed Systems - COMP 655 2

Fault Tolerance• Fault tolerance concepts• Implementation – distributed agreement• Distributed agreement meets transaction

processing: 2- and 3-phase commit

Bonus material• Implementation – reliable point-to-point

communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing


Fault tolerance concepts• Availability – can I use it now?

– Usually quantified as a percentage• Reliability – can I use it for a

certain period of time?– Usually quantified as MTBF

• Safety – will anything really bad happen if it does fail?

• Maintainability – how hard is it to fix when it fails?– Usually quantified as MTTR


Comparing nines• 1 year = 8760 hr• Availability levels

– 90% = 876 hr downtime/yr– 99% = 87.6 hr downtime/yr– 99.9% = 8.76 hr downtime/yr– 99.99% = 52.56 min downtime/yr– 99.999% = 5.256 min downtime/yr


Exercise: how to get five nines

1. Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider:

– Hardware failures, especially disks– Power failures– Network outages– Software installation– What else?

2. Come up with some ideas about how to solve the problems you identify


Multiple machines at 99%

Assuming independent failures


1,000 components


Things to watch out for in availability requirements

• What constitutes an outage …– A client PC going down?– A client applet going into an infinite

loop?– A server crashing?– A network outage?– Reports unavailable?– If a transaction times out?– If 100 transactions time out in a 10

min period?– etc


More to watch out for• What constitutes being back up

after an outage?• When does an outage start?• When does it end?• Are there outages that don’t

count?– Natural disasters?– Outages due to operator errors?

• What about MTBF?


Ways to get 99% availability

1. MTBF = 99 hr, MTTR = 1 hr2. MTBF = 99 min, MTTR = 1 min3. MTBF = 99 sec, MTTR = 1 sec


More definitions

failure

error

fault

causes

may causeFault tolerance is continuing to work correctly in the presence of faults.

Types of faults:• transient• intermittent• permanent


Types of failures


If you remember one thing• Components fail in distributed systems

on a regular basis.• Distributed systems have to be

designed to deal with the failure of individual components so that the system as a whole– Is available and/or– Is reliable and/or– Is safe and/or– Is maintainable

depending on the problem it is trying to solve and the resources available …


Fault Tolerance• Fault tolerance concepts• Implementation – distributed

agreement• Distributed agreement meets

transaction processing: 2- and 3-phase commit


Two-army problem• Red army has 5,000 troops• Blue army and White army have

3,000 troops each• Attack together and win• Attack separately and lose in serial• Communication is by messenger,

who might be captured• Blue and white generals have no

way to know when a messenger is captured


Activity: outsmart the generals

• Take your best shot at designing a protocol that can solve the two-army problem

• Spend ten minutes• Did you think of anything

promising?


Conclusion: go home• “agreement between even two

processes is not possible in the face of unreliable communication”


Byzantine generals• Assume perfect communication• Assume n generals, m of whom

should not be trusted• The problem is to reach agreement

on troop strength among the non-faulty generals


Byzantine generals - example

n = 4, m = 1(units are K-troops)

(a) Multicast troop-strength messages(b) Construct troop-strength vectors(c) Compare notes: majority rules in each componentResult: 1, 2, and 4 agree on (1,2,unknown,4)


Doesn’t work with n=3, m=1


Fault Tolerance• Fault tolerance concepts• Implementation – distributed

agreement• Distributed agreement meets

transaction processing: 2- and 3-phase commit


Distributed commit protocols

• What is the problem they are trying to solve?– Ensure that a group of processes all

do something, or none of them do– Example: in a distributed transaction

that involves updates to data on three different servers, ensure that all three commit or none of them do


2-phase commit

Coordinator Participant

What to do when P, in READY state, contacts Q


If coordinator crashes• Participants could wait until the

coordinator recovers• Or, they could try to figure out

what to do among themselves– Example, if P contacts Q, and Q is in

the COMMIT state, P should COMMIT as well


2-phase commitWhat to do when P, in READY state, contacts Q

If all surviving participants are in READY state,1. Wait for coordinator to recover2. Elect a new coordinator (?)


3-phase commit• Problem addressed:

– Non-blocking distributed commit in the presence of failures

– Interesting theoretically, but rarely used in practice


3-phase commit

Coordinator Participant


Bonus material• Implementation – reliable point-to-

point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing


RPC, RMI crash & omission failures

• Client can’t locate server• Request lost• Server crashes after receipt of

request• Response lost• Client crashes after sending

request


Can’t locate server• Raise an exception, or• Send a signal, or• Log an error and return an error

code

Note: hard to mask distribution in this case


Request lost• Timeout and retry• Back off to “cannot locate server”

if too many timeouts occur


Server crashes after receipt of request

• Possible semantic commitments– Exactly once– At least once– At most once

Normal Work done Work not done


Behavioral possibilities• Server events

– Process (P)– Send completion message (M)– Crash (C)

• Server order– P then M– M then P

• Client strategies– Retry every message– Retry no messages– Retry if unacknowledged– Retry if acknowledged


Combining the options


Lost replies• Make server operations

idempotent whenever possible• Structure requests so that server

can distinguish retries from the original


Client crashes• The server-side activity is called an orphan computation

• Orphans can tie up resources, hold locks, etc

• Four strategies (at least)– Extermination, based on client-side logs

• Client writes a log record before and after each call• When client restarts after a crash, it checks the log

and kills outstanding orphan computations• Problems include:

– Lots of disk activity– Grand-orphans


Client crashes, continued• More approaches for handling orphans

– Re-incarnation, based on client-defined epochs• When client restarts after a crash, it

broadcasts a start-of-epoch message• On receipt of a start-of-epoch message, each

server kills any computation for that client

– “Gentle” re-incarnation• Similar, but server tries to verify that a

computation is really an orphan before killing it


Yet more client-crash strategies

• One more strategy– Expiration

• Each computation has a lease on life• If not complete when the lease expires, a

computation must obtain another lease from its owner

• Clients wait one lease period before restarting after a crash (so any orphans will be gone)

• Problem: what’s a reasonable lease period?


Common problems with client-crash strategies

• Crashes that involve network partition(communication between partitions will

not work at all)

• Killed orphans may leave persistent traces behind, for example– Locks– Requests in message queues


How to do it?• Redundancy applied

– In the appropriate places– In the appropriate ways

• Types of redundancy– Data (e.g. error correcting codes,

replicated data)– Time (e.g. retry)– Physical (e.g. replicated hardware,

backup systems)


Triple Modular Redundancy


Tandem Computers• TMR on

– CPUs– Memory

• Duplicated– Buses– Disks– Power supplies

• A big hit in operations systems for a while


Replicated processing• Based on process groups• A process group consists of one or more

identical processes• Key events

– Message sent to one member of a group– Process joins group– Process leaves group– Process crashes

• Key requirements– Messages must be received by all members– All members must agree on group

membership


Flat or non-flat?


Effective process groups require

• Distributed agreement– On group membership– On coordinator elections– On whether or not to commit a

transaction

• Effective communication– Reliable enough– Scalable enough– Often, multicast– Typically looking for atomic multicast


Process groups also require

• Ability to tolerate crash failures and omission failures– Need k+1 processes to deal with up to k

silent failures

• Ability to tolerate performance, response, and arbitrary failures– Need 3k+1 processes to reach agreement

with up to k Byzantine failures– Need 2k+1 processes to ensure that a

majority of the system produces the correct results with up to k Byzantine failures


Reliable multicasting


Scalability problem• Too many acknowledgements

– One from each receiver– Can be a huge number in some

systems– Also known as “feedback implosion”


Basic feedback suppression in scalable

reliable multicast

If a receiver decides it has missed a message,• it waits a random time, then multicasts a retransmission request• while waiting, if it sees a sufficient request from another receiver,

it does not send its own request• server multicasts all retransmissions


Hierarchical feedback suppression for scalable

reliable multicast

• messages flow from root toward leaves• acks and retransmit requests flow toward root from coordinators• each group can use any reliable small-group multicast scheme


Atomic multicast• Often, in a distributed system,

reliable multicast is a step toward atomic multicast

• Atomic multicast is atomicity applied to communications:– Either all members of a process group

receive a message, OR– No members receive it

• Often requires some form of order agreement as well


How atomic multicast helps

1. Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database

2. One replica goes down3. Database activity continues4. The process comes back up5. Atomic multicast allows us to figure

out exactly which transactions have to be re-played (see pp 386-387)


More concepts• Group view• View change• Virtually synchronous

– Each message is received by all non-faulty processes, or

– If sender crashes during multicast, message could be ignored by all processes


Virtual synchrony picture

Basic idea:in virtual synchrony, a multicast cannot cross a view-change


Receipt vs Delivery

Remember totally-ordered multicast …


What about multicast message order?

• Two aspects:– Relationship between sending order and

delivery order– Agreement on delivery order

• Send/delivery ordering relationships– Unordered– FIFO-ordered– Causally-ordered

• If receivers agree on delivery order, it’s called totally-ordered multicast


UnorderedProcess P1 Process P2 Process P3

sends m1sends m2

delivers m1delivers m2

delivers m2delivers m1


FIFO-ordered

Agreement on: m1 before m2 m3 before m4

Process P1 Process P2 Process P3

sends m1sends m2

delivers m1delivers m3delivers m2delivers m4

delivers m3delivers m1delivers m2delivers m4

Process P4

sends m3sends m4


Six types of virtually synchronous reliable

multicast

Relationship between sendingorder and delivery order

Agreement ondelivery order


Implementing virtual synchrony

Don’t deliver a message until it’s been received everywhere -but “everywhere” can change

(a) 7’s crash is detected by 4, which sends a view-change message

(b) Processes forward unstable messages, followed by flush

(c) When have flush from all processes in new view, install new view


Recovery from error • Two main types:

– Backward recovery to a checkpoint (assumed to be error-free)

– Forward recovery (infer a correct state from available data)


More about checkpoints• They are expensive• Usually combined with a message log• Message logs are cleared at checkpoints• Recovering a crashed process:

– Restart it– Restore its state to the most recent

checkpoint– Replay the message log


Recovery line == most recent distributed

snapshot


Domino effect


Sparing• Not really fault tolerance• But it can be cheaper, and provide

fast restoration time after a failure• Types of spares

– Cold– Hot– Warm

• The spare may or may not also have regular responsibilities in the system


Switchover• Repair is accomplished by

switching processing away from a failed server to a spare


Questions on switchover• Has the failed system really failed?• Is the spare operational?• Can the spare handle the load?

– May need a way to block medium to low priority work during switchovers

• How will the spare get access to the failed server’s data?

• What client session data will be preserved, and how?


More switchover questions• What about configuration files?• What about network addressing?• What about switching back after the

failed server has been repaired?– Partial shutdown of the spare– Updating directories to redirect part of the

load– Making up for lost medium-to-low priority

work

comp 655: distributed/operating systems summer 2011 dr. chunbo chu week 7: fault tolerance...

Documents

secdistributed systems

etcdistributed systems

mttrdistributed systems

hr downtimeyr99

multiple machines

definitionsfault tolerance

types of faults

certain period of time