comp 655: distributed/operating systems summer 2011 dr. chunbo chu week 7: fault tolerance...
TRANSCRIPT
![Page 1: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/1.jpg)
COMP 655:Distributed/Operating
SystemsSummer 2011
Dr. Chunbo ChuWeek 7: Fault Tolerance
04/20/23 1Distributed Systems - COMP 655
![Page 2: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/2.jpg)
04/20/23 Distributed Systems - COMP 655 2
Fault Tolerance• Fault tolerance concepts• Implementation – distributed agreement• Distributed agreement meets transaction
processing: 2- and 3-phase commit
Bonus material• Implementation – reliable point-to-point
communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 3: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/3.jpg)
04/20/23 Distributed Systems - COMP 655 3
Fault tolerance concepts• Availability – can I use it now?
– Usually quantified as a percentage• Reliability – can I use it for a
certain period of time?– Usually quantified as MTBF
• Safety – will anything really bad happen if it does fail?
• Maintainability – how hard is it to fix when it fails?– Usually quantified as MTTR
![Page 4: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/4.jpg)
04/20/23 Distributed Systems - COMP 655 4
Comparing nines• 1 year = 8760 hr• Availability levels
– 90% = 876 hr downtime/yr– 99% = 87.6 hr downtime/yr– 99.9% = 8.76 hr downtime/yr– 99.99% = 52.56 min downtime/yr– 99.999% = 5.256 min downtime/yr
![Page 5: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/5.jpg)
04/20/23 Distributed Systems - COMP 655 5
Exercise: how to get five nines
1. Brainstorm what you would have to deal with to build a single-machine system that could run for five years with 25 min downtime. Consider:
– Hardware failures, especially disks– Power failures– Network outages– Software installation– What else?
2. Come up with some ideas about how to solve the problems you identify
![Page 6: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/6.jpg)
04/20/23 Distributed Systems - COMP 655 6
Multiple machines at 99%
Assuming independent failures
![Page 7: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/7.jpg)
04/20/23 Distributed Systems - COMP 655 7
Multiple machines at 95%
Assuming independent failures
![Page 8: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/8.jpg)
04/20/23 Distributed Systems - COMP 655 8
Multiple machines at 80%
Assuming independent failures
![Page 9: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/9.jpg)
04/20/23 Distributed Systems - COMP 655 9
1,000 components
![Page 10: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/10.jpg)
04/20/23 Distributed Systems - COMP 655 10
Things to watch out for in availability requirements
• What constitutes an outage …– A client PC going down?– A client applet going into an infinite
loop?– A server crashing?– A network outage?– Reports unavailable?– If a transaction times out?– If 100 transactions time out in a 10
min period?– etc
![Page 11: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/11.jpg)
04/20/23 Distributed Systems - COMP 655 11
More to watch out for• What constitutes being back up
after an outage?• When does an outage start?• When does it end?• Are there outages that don’t
count?– Natural disasters?– Outages due to operator errors?
• What about MTBF?
![Page 12: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/12.jpg)
04/20/23 Distributed Systems - COMP 655 12
Ways to get 99% availability
1. MTBF = 99 hr, MTTR = 1 hr2. MTBF = 99 min, MTTR = 1 min3. MTBF = 99 sec, MTTR = 1 sec
![Page 13: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/13.jpg)
04/20/23 Distributed Systems - COMP 655 13
More definitions
failure
error
fault
causes
may causeFault tolerance is continuing to work correctly in the presence of faults.
Types of faults:• transient• intermittent• permanent
![Page 14: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/14.jpg)
04/20/23 Distributed Systems - COMP 655 14
Types of failures
![Page 15: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/15.jpg)
04/20/23 Distributed Systems - COMP 655 15
If you remember one thing• Components fail in distributed systems
on a regular basis.• Distributed systems have to be
designed to deal with the failure of individual components so that the system as a whole– Is available and/or– Is reliable and/or– Is safe and/or– Is maintainable
depending on the problem it is trying to solve and the resources available …
![Page 16: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/16.jpg)
04/20/23 Distributed Systems - COMP 655 16
Fault Tolerance• Fault tolerance concepts• Implementation – distributed
agreement• Distributed agreement meets
transaction processing: 2- and 3-phase commit
![Page 17: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/17.jpg)
04/20/23 Distributed Systems - COMP 655 17
Two-army problem• Red army has 5,000 troops• Blue army and White army have
3,000 troops each• Attack together and win• Attack separately and lose in serial• Communication is by messenger,
who might be captured• Blue and white generals have no
way to know when a messenger is captured
![Page 18: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/18.jpg)
04/20/23 Distributed Systems - COMP 655 18
Activity: outsmart the generals
• Take your best shot at designing a protocol that can solve the two-army problem
• Spend ten minutes• Did you think of anything
promising?
![Page 19: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/19.jpg)
04/20/23 Distributed Systems - COMP 655 19
Conclusion: go home• “agreement between even two
processes is not possible in the face of unreliable communication”
![Page 20: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/20.jpg)
04/20/23 Distributed Systems - COMP 655 20
Byzantine generals• Assume perfect communication• Assume n generals, m of whom
should not be trusted• The problem is to reach agreement
on troop strength among the non-faulty generals
![Page 21: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/21.jpg)
04/20/23 Distributed Systems - COMP 655 21
Byzantine generals - example
n = 4, m = 1(units are K-troops)
(a) Multicast troop-strength messages(b) Construct troop-strength vectors(c) Compare notes: majority rules in each componentResult: 1, 2, and 4 agree on (1,2,unknown,4)
![Page 22: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/22.jpg)
04/20/23 Distributed Systems - COMP 655 22
Doesn’t work with n=3, m=1
![Page 23: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/23.jpg)
04/20/23 Distributed Systems - COMP 655 23
Fault Tolerance• Fault tolerance concepts• Implementation – distributed
agreement• Distributed agreement meets
transaction processing: 2- and 3-phase commit
![Page 24: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/24.jpg)
04/20/23 Distributed Systems - COMP 655 24
Distributed commit protocols
• What is the problem they are trying to solve?– Ensure that a group of processes all
do something, or none of them do– Example: in a distributed transaction
that involves updates to data on three different servers, ensure that all three commit or none of them do
![Page 25: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/25.jpg)
04/20/23 Distributed Systems - COMP 655 25
2-phase commit
Coordinator Participant
What to do when P, in READY state, contacts Q
![Page 26: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/26.jpg)
04/20/23 Distributed Systems - COMP 655 26
If coordinator crashes• Participants could wait until the
coordinator recovers• Or, they could try to figure out
what to do among themselves– Example, if P contacts Q, and Q is in
the COMMIT state, P should COMMIT as well
![Page 27: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/27.jpg)
04/20/23 Distributed Systems - COMP 655 27
2-phase commitWhat to do when P, in READY state, contacts Q
If all surviving participants are in READY state,1. Wait for coordinator to recover2. Elect a new coordinator (?)
![Page 28: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/28.jpg)
04/20/23 Distributed Systems - COMP 655 28
3-phase commit• Problem addressed:
– Non-blocking distributed commit in the presence of failures
– Interesting theoretically, but rarely used in practice
![Page 29: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/29.jpg)
04/20/23 Distributed Systems - COMP 655 29
3-phase commit
Coordinator Participant
![Page 30: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/30.jpg)
04/20/23 Distributed Systems - COMP 655 30
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 31: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/31.jpg)
04/20/23 Distributed Systems - COMP 655 31
RPC, RMI crash & omission failures
• Client can’t locate server• Request lost• Server crashes after receipt of
request• Response lost• Client crashes after sending
request
![Page 32: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/32.jpg)
04/20/23 Distributed Systems - COMP 655 32
Can’t locate server• Raise an exception, or• Send a signal, or• Log an error and return an error
code
Note: hard to mask distribution in this case
![Page 33: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/33.jpg)
04/20/23 Distributed Systems - COMP 655 33
Request lost• Timeout and retry• Back off to “cannot locate server”
if too many timeouts occur
![Page 34: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/34.jpg)
04/20/23 Distributed Systems - COMP 655 34
Server crashes after receipt of request
• Possible semantic commitments– Exactly once– At least once– At most once
Normal Work done Work not done
![Page 35: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/35.jpg)
04/20/23 Distributed Systems - COMP 655 35
Behavioral possibilities• Server events
– Process (P)– Send completion message (M)– Crash (C)
• Server order– P then M– M then P
• Client strategies– Retry every message– Retry no messages– Retry if unacknowledged– Retry if acknowledged
![Page 36: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/36.jpg)
04/20/23 Distributed Systems - COMP 655 36
Combining the options
![Page 37: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/37.jpg)
04/20/23 Distributed Systems - COMP 655 37
Lost replies• Make server operations
idempotent whenever possible• Structure requests so that server
can distinguish retries from the original
![Page 38: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/38.jpg)
04/20/23 Distributed Systems - COMP 655 38
Client crashes• The server-side activity is called an orphan computation
• Orphans can tie up resources, hold locks, etc
• Four strategies (at least)– Extermination, based on client-side logs
• Client writes a log record before and after each call• When client restarts after a crash, it checks the log
and kills outstanding orphan computations• Problems include:
– Lots of disk activity– Grand-orphans
![Page 39: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/39.jpg)
04/20/23 Distributed Systems - COMP 655 39
Client crashes, continued• More approaches for handling orphans
– Re-incarnation, based on client-defined epochs• When client restarts after a crash, it
broadcasts a start-of-epoch message• On receipt of a start-of-epoch message, each
server kills any computation for that client
– “Gentle” re-incarnation• Similar, but server tries to verify that a
computation is really an orphan before killing it
![Page 40: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/40.jpg)
04/20/23 Distributed Systems - COMP 655 40
Yet more client-crash strategies
• One more strategy– Expiration
• Each computation has a lease on life• If not complete when the lease expires, a
computation must obtain another lease from its owner
• Clients wait one lease period before restarting after a crash (so any orphans will be gone)
• Problem: what’s a reasonable lease period?
![Page 41: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/41.jpg)
04/20/23 Distributed Systems - COMP 655 41
Common problems with client-crash strategies
• Crashes that involve network partition(communication between partitions will
not work at all)
• Killed orphans may leave persistent traces behind, for example– Locks– Requests in message queues
![Page 42: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/42.jpg)
04/20/23 Distributed Systems - COMP 655 42
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 43: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/43.jpg)
04/20/23 Distributed Systems - COMP 655 43
How to do it?• Redundancy applied
– In the appropriate places– In the appropriate ways
• Types of redundancy– Data (e.g. error correcting codes,
replicated data)– Time (e.g. retry)– Physical (e.g. replicated hardware,
backup systems)
![Page 44: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/44.jpg)
04/20/23 Distributed Systems - COMP 655 44
Triple Modular Redundancy
![Page 45: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/45.jpg)
04/20/23 Distributed Systems - COMP 655 45
Tandem Computers• TMR on
– CPUs– Memory
• Duplicated– Buses– Disks– Power supplies
• A big hit in operations systems for a while
![Page 46: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/46.jpg)
04/20/23 Distributed Systems - COMP 655 46
Replicated processing• Based on process groups• A process group consists of one or more
identical processes• Key events
– Message sent to one member of a group– Process joins group– Process leaves group– Process crashes
• Key requirements– Messages must be received by all members– All members must agree on group
membership
![Page 47: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/47.jpg)
04/20/23 Distributed Systems - COMP 655 47
Flat or non-flat?
![Page 48: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/48.jpg)
04/20/23 Distributed Systems - COMP 655 48
Effective process groups require
• Distributed agreement– On group membership– On coordinator elections– On whether or not to commit a
transaction
• Effective communication– Reliable enough– Scalable enough– Often, multicast– Typically looking for atomic multicast
![Page 49: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/49.jpg)
04/20/23 Distributed Systems - COMP 655 49
Process groups also require
• Ability to tolerate crash failures and omission failures– Need k+1 processes to deal with up to k
silent failures
• Ability to tolerate performance, response, and arbitrary failures– Need 3k+1 processes to reach agreement
with up to k Byzantine failures– Need 2k+1 processes to ensure that a
majority of the system produces the correct results with up to k Byzantine failures
![Page 50: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/50.jpg)
04/20/23 Distributed Systems - COMP 655 50
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 51: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/51.jpg)
04/20/23 Distributed Systems - COMP 655 51
Reliable multicasting
![Page 52: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/52.jpg)
04/20/23 Distributed Systems - COMP 655 52
Scalability problem• Too many acknowledgements
– One from each receiver– Can be a huge number in some
systems– Also known as “feedback implosion”
![Page 53: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/53.jpg)
04/20/23 Distributed Systems - COMP 655 53
Basic feedback suppression in scalable
reliable multicast
If a receiver decides it has missed a message,• it waits a random time, then multicasts a retransmission request• while waiting, if it sees a sufficient request from another receiver,
it does not send its own request• server multicasts all retransmissions
![Page 54: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/54.jpg)
04/20/23 Distributed Systems - COMP 655 54
Hierarchical feedback suppression for scalable
reliable multicast
• messages flow from root toward leaves• acks and retransmit requests flow toward root from coordinators• each group can use any reliable small-group multicast scheme
![Page 55: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/55.jpg)
04/20/23 Distributed Systems - COMP 655 55
Atomic multicast• Often, in a distributed system,
reliable multicast is a step toward atomic multicast
• Atomic multicast is atomicity applied to communications:– Either all members of a process group
receive a message, OR– No members receive it
• Often requires some form of order agreement as well
![Page 56: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/56.jpg)
04/20/23 Distributed Systems - COMP 655 56
How atomic multicast helps
1. Assume we have atomic multicast, among a group of processes, each of which owns a replica of a database
2. One replica goes down3. Database activity continues4. The process comes back up5. Atomic multicast allows us to figure
out exactly which transactions have to be re-played (see pp 386-387)
![Page 57: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/57.jpg)
04/20/23 Distributed Systems - COMP 655 57
More concepts• Group view• View change• Virtually synchronous
– Each message is received by all non-faulty processes, or
– If sender crashes during multicast, message could be ignored by all processes
![Page 58: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/58.jpg)
04/20/23 Distributed Systems - COMP 655 58
Virtual synchrony picture
Basic idea:in virtual synchrony, a multicast cannot cross a view-change
![Page 59: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/59.jpg)
04/20/23 Distributed Systems - COMP 655 59
Receipt vs Delivery
Remember totally-ordered multicast …
![Page 60: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/60.jpg)
04/20/23 Distributed Systems - COMP 655 60
What about multicast message order?
• Two aspects:– Relationship between sending order and
delivery order– Agreement on delivery order
• Send/delivery ordering relationships– Unordered– FIFO-ordered– Causally-ordered
• If receivers agree on delivery order, it’s called totally-ordered multicast
![Page 61: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/61.jpg)
04/20/23 Distributed Systems - COMP 655 61
UnorderedProcess P1 Process P2 Process P3
sends m1sends m2
delivers m1delivers m2
delivers m2delivers m1
![Page 62: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/62.jpg)
04/20/23 Distributed Systems - COMP 655 62
FIFO-ordered
Agreement on: m1 before m2 m3 before m4
Process P1 Process P2 Process P3
sends m1sends m2
delivers m1delivers m3delivers m2delivers m4
delivers m3delivers m1delivers m2delivers m4
Process P4
sends m3sends m4
![Page 63: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/63.jpg)
04/20/23 Distributed Systems - COMP 655 63
Six types of virtually synchronous reliable
multicast
Relationship between sendingorder and delivery order
Agreement ondelivery order
![Page 64: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/64.jpg)
04/20/23 Distributed Systems - COMP 655 64
Implementing virtual synchrony
Don’t deliver a message until it’s been received everywhere -but “everywhere” can change
(a) 7’s crash is detected by 4, which sends a view-change message
(b) Processes forward unstable messages, followed by flush
(c) When have flush from all processes in new view, install new view
![Page 65: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/65.jpg)
04/20/23 Distributed Systems - COMP 655 65
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 66: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/66.jpg)
04/20/23 Distributed Systems - COMP 655 66
Recovery from error • Two main types:
– Backward recovery to a checkpoint (assumed to be error-free)
– Forward recovery (infer a correct state from available data)
![Page 67: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/67.jpg)
04/20/23 Distributed Systems - COMP 655 67
More about checkpoints• They are expensive• Usually combined with a message log• Message logs are cleared at checkpoints• Recovering a crashed process:
– Restart it– Restore its state to the most recent
checkpoint– Replay the message log
![Page 68: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/68.jpg)
04/20/23 Distributed Systems - COMP 655 68
Recovery line == most recent distributed
snapshot
![Page 69: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/69.jpg)
04/20/23 Distributed Systems - COMP 655 69
Domino effect
![Page 70: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/70.jpg)
04/20/23 Distributed Systems - COMP 655 70
Bonus material• Implementation – reliable point-to-
point communication• Implementation – process groups• Implementation – reliable multicast• Recovery• Sparing
![Page 71: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/71.jpg)
04/20/23 Distributed Systems - COMP 655 71
Sparing• Not really fault tolerance• But it can be cheaper, and provide
fast restoration time after a failure• Types of spares
– Cold– Hot– Warm
• The spare may or may not also have regular responsibilities in the system
![Page 72: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/72.jpg)
04/20/23 Distributed Systems - COMP 655 72
Switchover• Repair is accomplished by
switching processing away from a failed server to a spare
![Page 73: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/73.jpg)
04/20/23 Distributed Systems - COMP 655 73
Questions on switchover• Has the failed system really failed?• Is the spare operational?• Can the spare handle the load?
– May need a way to block medium to low priority work during switchovers
• How will the spare get access to the failed server’s data?
• What client session data will be preserved, and how?
![Page 74: COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649f205503460f94c38c5b/html5/thumbnails/74.jpg)
04/20/23 Distributed Systems - COMP 655 74
More switchover questions• What about configuration files?• What about network addressing?• What about switching back after the
failed server has been repaired?– Partial shutdown of the spare– Updating directories to redirect part of the
load– Making up for lost medium-to-low priority
work