fault-tolerance techniques rsm, paxos
DESCRIPTION
Fault-tolerance techniques RSM, Paxos. Jinyang Li. What we’ve learnt so far. Fault tolerance Recoverability All-or-nothing atomicity for updates involving a single server. 2P commit All-or-nothing atomicity for updates involving >=2 servers. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/1.jpg)
Fault-tolerance techniques
RSM, Paxos
Jinyang Li
![Page 2: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/2.jpg)
What we’ve learnt so far
• Fault tolerance– Recoverability
• All-or-nothing atomicity for updates involving a single server.
– 2P commit• All-or-nothing atomicity for updates involving >=2
servers.
– However, system is down while waiting for crashed nodes to reboot
• This class:– Ensure high availability through replication
![Page 3: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/3.jpg)
Achieving high availability using replication
QuickTime™ and a decompressor
are needed to see this picture.QuickTime™ and a decompressor
are needed to see this picture.
B
QuickTime™ and a decompressor
are needed to see this picture.
C
Idea: upon A’s failure, serve requests from either B or C.Challenge: ensure sequential consistency across such reconfiguration
A
![Page 4: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/4.jpg)
RSM: Replicated state machine
• RSM is a general replication method– Lab 8: apply RSM to lock service
• RSM Rules:– All replicas start in the same initial state– Every replica apply operations in the same order– All operations must be deterministic
• All replicas end up in the same state
![Page 5: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/5.jpg)
Strawman RSM
• Does it ensure sequential consistency?
QuickTime™ and a decompressor
are needed to see this picture.QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.
opop
QuickTime™ and a decompressor
are needed to see this picture.
op op op op op op
![Page 6: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/6.jpg)
RSM based on primary/backup
• Primary/backup: ensure a single order of ops:– Primary orders operations– Backups execute operations in order
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.
op op
primary
backupop
QuickTime™ and a decompressor
are needed to see this picture.backup
op
op
op op
op
![Page 7: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/7.jpg)
RSM: read-only operations
• Read-only operations need not be replicated
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.
W(x)
primary
backupW(x)
QuickTime™ and a decompressor
are needed to see this picture.backup
W(x)
W(x)
R(x)
![Page 8: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/8.jpg)
RSM: read-only operations
• Can clients send read-only ops to any server?
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.
W(x)
primary
backup
QuickTime™ and a decompressor
are needed to see this picture.backup
R(x)W(x)
![Page 9: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/9.jpg)
RSM: read-only operations
• Can clients send read-only ops to any server?
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressorare needed to see this picture.QuickTime™ and a decompressorare needed to see this picture.
W(x)1
primary
backup
QuickTime™ and a decompressor
are needed to see this picture.
backup
W(x)1W(x)1
W(x)1
X’s initial value is 0
R(x)1
R(x) 0
![Page 10: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/10.jpg)
RSM failure handling• If primary fails, one backup acts as the new primary• Challenges:
1. How to reliable detect primary failure?2. How to ensure no 2 backups simultaneously become
primary? 3. How to preserve sequential consistency across primary
changes?• Primary can fail after sending an operation W to backup A but
before sending W to B• A and B must agree on whether W is reflected in the new state
after reconfiguration
Paxos, a fault-tolerant consensus protocol, addresses these challenges
![Page 11: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/11.jpg)
Case study: Hypervisor [Bressoud and Schneider]
• Goal: fault tolerant computing– Banks, NASA etc. need it– In the 80s, CPUs are very likely to fail
• Hypervisor: primary/backup replication– If primary fails, backup takes over– Caveat: assuming perfect failure detection
![Page 12: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/12.jpg)
Hypervisor replicates at VM-level
• Why replicating at VM-level?– Hardware fault-tolerant machines are big in 80s– Software solution is more economical– Replicating at O/S level is messy (many interfaces)– Replicating at app level requires programmer efforts
• Primary and backup execute the same sequence of machine instructions
![Page 13: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/13.jpg)
A Strawman design
• Two identical machines• Same initial memory/disk contents• Start execution on both machines• Will they perform the same computation?
QuickTime™ and a decompressor
are needed to see this picture.QuickTime™ and a decompressor
are needed to see this picture.
mem mem
![Page 14: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/14.jpg)
Hypervisor’s basic plan
• Execute one instruction at a time using primary/backup
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
primarybackup
Executed i0?
yexecuted inst i1?
ok
i0
i1
i2
i0
i1
![Page 15: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/15.jpg)
Hypervisor Challenges• Operations must be deterministic.
– ADD, MUL etc.– Read memory (?)
• How to handle non-deterministic ops?– Read time-of-day register– Read disk– Interrupt timing– External input devices (network, keyboard)
• Executing one instruction at a time is VERY SLOW
![Page 16: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/16.jpg)
Handle disk operations
QuickTime™ and a decompressor
are needed to see this picture.QuickTime™ and a decompressor
are needed to see this picture.
mem mem Strawman replicates disks at both machinesProblem: disksmight not behave identically(e.g. fail at different sectors)
primary
backup
SCSI bus
Hypervisor connects devices toto both machines• Only primary reads/writes to devices• Primary sends read values to backup• Only primary handles interrupts from h/w• Primary sends interrupts to backup
ethernet
![Page 17: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/17.jpg)
Hypervisor executes in epochs
• Challenge: executing one instruction at a time is slow
• Hypervisor executes in epochs– CPU h/w interrupts every N instructions (so both
nodes stop at the same point)– Primary delays all interrupts till end of an epoch– Primary sends all interrupts to backup– Primary/backup execute all interrupts at an epoch’s
end.
![Page 18: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/18.jpg)
Hypervisor failover
• Primary fails at epoch E– backup times out waiting for primary to announce
end of epoch E– Backup delivers all buffered interrupts at the end
of E– Backup starts epoch E+1– Backup becomes primary at epoch E+1
• What about I/O at epoch E?
![Page 19: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/19.jpg)
Hypervisor failover
• Backup does not know if primary executed I/O epoch E?– Relies on O/S to re-try the I/O
• Device needs to support repeated ops– OK for disk writes/reads– OK for network (TCP will figure it out)– How about keyboard, printer, ATM cash
machine?
![Page 20: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/20.jpg)
Hypervisor implementation
• Hypervisor needs to trap every non-deterministic instruction– Time-of-day register– HP TLB replacement– HP branch-and-link instruction– Memory-mapped I/O loads/stores
• Performance penalty is reasonable– A factor of two slow down (HP 9000/720 50MHz)– How about its performance on modern hardware?
![Page 21: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/21.jpg)
Caveats in Hypervisor
• Hypervisor assumes failure detection is perfect• What if the network between primary/backup
fails?– Primary is still running– Backup becomes a new primary– Two primaries at the same time!
• Can timeouts detect failures correctly?– Pings from backup to primary are lost– Pings from backup to primary are delayed
![Page 22: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/22.jpg)
Paxos: fault tolerance consensus
![Page 23: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/23.jpg)
Paxos: fault tolerant consensus
• Paxos lets all nodes agree on the same value despite node failures, network failures and delays
• Extremely useful:– e.g. Nodes agree that X is the primary– e.g. Nodes agree that W should be the
most recent operation executed
![Page 24: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/24.jpg)
Requirements of consensus• Correctness (safety):
– All nodes agree on the same value– The agreed value X has been proposed by
some node
• Fault-tolerance:– If less than some fraction of nodes fail, the
rest should still reach agreement
• Termination:
![Page 25: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/25.jpg)
Fischer-Lynch-Paterson [FLP’85] impossibility result
• It is impossible for a set of processors in an asynchronous system to agree on a binary value, even if only a single processor is subject to an unannounced failure.
• Asynchrony --> timeout is not perfect
![Page 26: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/26.jpg)
Paxos
• Paxos: the only known fault-tolerant agreement protocol
• Paxos’ properties:– Correct– Fault-tolerance:
• If less than N/2 nodes fail, the rest nodes reach agreement eventually
– No guaranteed termination
![Page 27: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/27.jpg)
Paxos: general approach
• One (or more) node decides to be the leader
• Leader proposes a value and solicits acceptance from others
• Leader announces result or try again
![Page 28: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/28.jpg)
Paxos’ challenges
• What if >1 nodes become leaders simultaneously?• What if there is a network partition?• What if a leader crashes in the middle of solicitation?• What if a leader crashes after deciding but before
announcing results?• What if the new leader proposes different values than
already decided value?
![Page 29: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/29.jpg)
Paxos setup
• Each node runs as a proposer, acceptor and learner
• Proposer (leader) proposes a value and solicit acceptance from acceptors
• Leader announces the chosen value to learners
![Page 30: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/30.jpg)
Strawman
• Designate a single node X as acceptor (e.g. one with smallest id)– Each proposer sends its value to X– X decides on one of the values– X announces its decision to all learners
• Problem?– Failure of the single acceptor halts decision– Need multiple acceptors!
![Page 31: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/31.jpg)
Strawman 2: multiple acceptors
• Each proposer (leader) propose to all acceptors• Each acceptor accepts the first proposal it receives and
rejects the rest• If the leader receives positive replies from a majority of
acceptors, it chooses its own value– There is at most 1 majority, hence only a single value is chosen
• Leader sends chosen value to all learners• Problem:
– What if multiple leaders propose simultaneously so there is no majority accepting?
– What if the leader dies?
![Page 32: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/32.jpg)
Paxos’ solution
• Each acceptor must be able to accept multiple proposals
• Order proposals by proposal # – If a proposal with value v is chosen, all
higher proposals have value v
![Page 33: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/33.jpg)
Paxos operation: node state
• Each node maintains:– na, va: highest proposal # accepted and its
corresponding accepted value – nh: highest proposal # seen– myn: my proposal # in the current Paxos
![Page 34: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/34.jpg)
Paxos operation: 3P protocol
• Phase 1 (Prepare)– A node decides to be leader (and propose)– Leader choose myn > nh – Leader sends <prepare, myn> to all nodes– Upon receiving <prepare, n>
If n < nh
reply <prepare-reject>
Else
nh = n
reply <prepare-ok, na,va>
This node will not accept any proposal lower than n
![Page 35: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/35.jpg)
Paxos operation• Phase 2 (Accept):
– If leader gets prepare-ok from a majorityV = non-empty value corresponding to the highest na receivedIf V= null, then leader can pick any VSend <accept, myn, V> to all nodes
– If leader fails to get majority prepare-ok• Delay and restart Paxos
– Upon receiving <accept, n, V>If n < nh
reply with <accept-reject>else na = n; va = V; nh = n reply with <accept-ok>
![Page 36: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/36.jpg)
Paxos operation
• Phase 3 (Decide)– If leader gets accept-ok from a majority
• Send <decide, va> to all nodes
– If leader fails to get accept-ok from a majority• Delay and restart Paxos
![Page 37: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/37.jpg)
Paxos operation: an example
Prepare,N1:1
N0 N1 N2
nh=N1:0na = va = null
nh=N0:0na = va = null
nh= N1:1na = nullva = null
ok, na= va=null
Prepare,N1:1
ok, na =va=nulllnh: N1:1na = nullva = null
nh=N2:0na = va = null
Accept,N1:1,val1Accept,N1:1,val1
nh=N1:1na = N1:1va = val1
nh=N1:1na = N1:1va = val1
okok
Decide,val1 Decide,val1
![Page 38: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/38.jpg)
Paxos properties
• When is the value V chosen?1. When leader receives a majority prepare-ok
and proposes V
2. When a majority nodes accept V
3. When the leader receives a majority accept-ok for value V
![Page 39: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/39.jpg)
Understanding Paxos
• What if more than one leader is active?
• Suppose two leaders use different proposal number, N0:10, N1:11
• Can both leaders see a majority of prepare-ok?
![Page 40: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/40.jpg)
Understanding Paxos
• What if leader fails while sending accept?
• What if a node fails after receiving accept?– If it doesn’t restart …– If it reboots …
• What if a node fails after sending prepare-ok?– If it reboots …
![Page 41: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/41.jpg)
Using Paxos for RSM• Fault-tolerant RSM requires consistent replica membership
– Membership: <primary, backups>
• All active nodes must agree on the sequence of view changes:– <vid-1, primary, backups><vid-2, primary, backups> ..
• Use Paxos to agree on the <primary, backups> for a particular vid– Many instances of Paxos execution, one for each vid.– Each Paxos instance agrees to a single value, e.g. v[1]=x,v[2]=y, …
![Page 42: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/42.jpg)
Lab7: Using Paxos to track view changes
V[1]: N1
V[2]: N1,N2
V[3]: N1,N2, N3
V[4]: N1,N2
All nodes start with static config vid1:N1,Paxos-instance-1 has static agreement v[1]:N1
N2 joins
Paxos-instance-2: make N1 agree on v[2]
N3 joins
Paxos-instance-3: make N1,N2 agree on v[3]
N3 fails
Paxos-instance-4: make N1,N2,N3 agree on v[4]
![Page 43: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/43.jpg)
Lab7: Using Paxos to track view changes
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
V[1]: N1
V[2]: N1,N2
V[1]: N1
V[2]: N1,N2
N1
N2
QuickTime™ and a decompressor
are needed to see this picture.
N3 V[1]: N1
N3 joins
Prepare, i=2, N3:1oldview, v[2]=N1,N2
![Page 44: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/44.jpg)
Lab8: Using Paxos to track view changes
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
V[1]: N1
V[2]: N1,N2
V[1]: N1
V[2]: N1,N2
QuickTime™ and a decompressor
are needed to see this picture.
N1
N2
N3 V[1]: N1
N3 joinsV[2]: N1,N2
Prepare, i=3, N3:1
Prepare, i=3, N3:1
![Page 45: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/45.jpg)
Lab8: reconfigurable RSM
• Use RSM to replicate lock_server• Primary (master) assigns a viewstamp to
each client requests – Viewstamp is a tuple (vid:seqno)– (1:1)(1:2)(1:3)(2:1)(2:2)
• Primary can send multiple outstanding requests to backups – All replicas execute client requests in viewstamp
order
![Page 46: Fault-tolerance techniques RSM, Paxos](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681472c550346895db46835/html5/thumbnails/46.jpg)
Lab8: Reconfigurable RSM
• What happens during a view change?– The last couple of outstanding requests might
be executed by some but not all replicas
• Must sync the state of all replicas before accepting requests in new view– All replicas transfer state from the primary– Since all agree on the primary, all replicas’
state are in sync