12. recovery study meeting m1 yuuki horita 2004/5/14
TRANSCRIPT
![Page 1: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/1.jpg)
12. Recovery
Study Meeting
M1 Yuuki Horita
2004/5/14
![Page 2: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/2.jpg)
Contents
Introduction Recovery Checkpointing
Difficulty of Checkpointing Synchronous checkpointing / recovery ( Asynchronous checkpointing / recovery )
![Page 3: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/3.jpg)
Introduction
Long computation in distributed environments High failure rate
Host failure (a lot of hosts) Network failure
One failure may disturb entire computation
⇒ Need to start it again from the beginning High cost
Why don’t we utilize the previous computation?
Recovery
![Page 4: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/4.jpg)
Recovery is not easy
Suppose that a parallel computation is running in distributed resources…
for(i=0; i<MAXITER; i++){
local_compute(); // compute at each host
global_state_exchange(); // communicate with neighbors
}
1
1
1
1
7 7
7 7
88
• need to save process states periodically
• usually other processes have to restore to previous state
• overhead
![Page 5: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/5.jpg)
Recovery
![Page 6: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/6.jpg)
Back/Forward Error Recovery
Forward-error recovery Only when it is possible to remove errors Enable processes to move forward Ex) Redundancy, vote
Backward-error recovery General Restore to a previous error-free state Ex) Checkpoint
![Page 7: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/7.jpg)
Backward-error recovery
operational-based approach Record all modifications of a process’ state
state-based approach Record complete state at certain point
Recovery
Forward-error Backward-error
operational-based state-based
![Page 8: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/8.jpg)
State-based approach
Terminology checkpointing
the process of saving state
checkpoint the recovery point at which checkpointing occurs
rolling back the process of restoring a process to a prior-state
![Page 9: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/9.jpg)
Checkpointing
![Page 10: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/10.jpg)
Problem of naïve checkpointing
Orphan Messages and the Domino Effect Orphan message :
a message that make an inconsistent state Domino Effect :
what a single rolling back induce other rolling back
Lost MessagesLivelocks
![Page 11: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/11.jpg)
Orphan message and Domino Effect
X
Y
Z
[
[
[
x1
y1
z1
[
[
[
[x2 x3
y2
z2
Roll back
Y has not sent yet, but X has received.
: Orphan message
: Domino Effect
![Page 12: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/12.jpg)
Lost messages
X
Y
Z
[
[
[
x1
y1
z1
[
[
[
x2
y2
z2
[x3
Roll back
X has sent, but Y cannot receive forever
: Lost message
![Page 13: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/13.jpg)
Livelocks
X
Y [y1
[x1
n1 m1n1
m2n2
![Page 14: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/14.jpg)
Consistency of Checkpoint
Strongly consistent set of checkpoints no messages penetrating the set
Consistent set of checkpoints no messages penetrating the set backward
[
[
[
x1
y1
z1
[
[
[
y2
x2
z2
Strongly consistent consistent
need to deal with lost messages
![Page 15: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/15.jpg)
Checkpoint/Recovery Algorithm
Synchronous with global synchronization at checkpointing
Asynchronous without global synchronization at checkpointing
![Page 16: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/16.jpg)
Preliminary (Assumption)
GoalTo make a consistent global checkpoint
Assumptions Communication channels are FIFO No partition of the network End-to-end protocols cope with message loss due to
rollback recovery and communication failure No failure during the execution of the algorithm
~ Synchronous Checkpoint ~
![Page 17: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/17.jpg)
Preliminary (Two types of checkpoint)
tentative checkpoint : a temporary checkpoint a candidate for permanent checkpoint
permanent checkpoint : a local checkpoint at a process a part of a consistent global checkpoint
~ Synchronous Checkpoint ~
![Page 18: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/18.jpg)
Checkpoint Algorithm
Algorithm1. an initiating process (a single process that invokes this algorithm) takes
a tentative checkpoint2. it requests all the processes to take tentative checkpoints3. it waits for receiving from all the processes whether taking a tentative
checkpoint has been succeeded 4. if it learns all the processes has succeeded, it decides all tentative
checkpoints should be made permanent; otherwise, should be discarded.
5. it informs all the processes of the decision6. The processes that receive the decision act accordingly
Supplement Once a process has taken a tentative checkpoint, it shouldn’t send
messages until it is informed of initiator’s decision.
~ Synchronous Checkpoint ~
![Page 19: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/19.jpg)
Diagram of Checkpoint Algorithm
[
[
[
|
|
Tentative checkpoint
|
request to take a tentative checkpoint
OK
decide to commit
[permanent checkpoint
[
[
consistent global checkpoint
consistent global checkpoint Unnecessary checkpoint
Initiator
~ Synchronous Checkpoint ~
![Page 20: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/20.jpg)
Optimized Algorithm
Each message is labeled by order of sending
Labeling Scheme⊥ : smallest labelт : largest label
last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. if not exists, is in it.⊥
first_label_sentX[Y] : the first message that X sent to Y after X took its last permanent or tentative checkpoint . if not exists, is in it.⊥
ckpt_cohortX :the set of all processes that may have to take checkpoints when X decides to take a checkpoint.
~ Synchronous Checkpoint ~
[
[
X
Y
x2x3
y1 y2
y2
x2
Checkpoint request need to be sent to only the processes included in ckpt_cohort
![Page 21: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/21.jpg)
Optimized Algorithm
ckpt_cohortX : { Y | last_label_rcvdX[Y] > ⊥ }
Y takes a tentative checkpoint only if
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
~ Synchronous Checkpoint ~
X
Y
[
[
last_label_rcvdX[Y]
first_label_sentY[X]
![Page 22: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/22.jpg)
Optimized AlgorithmAlgorithm
1. an initiating process takes a tentative checkpoint2. it requests p ∈ ckpt_cohort to take tentative checkpoints ( this
message includes last_label_rcvd[reciever] of sender )3. if the processes that receive the request need to take a
checkpoint, they do the same as 1.2.; otherwise, return OK messages.
4. they wait for receiving OK from all of p ∈ ckpt_cohort5. if the initiator learns all the processes have succeeded, it
decides all tentative checkpoints should be made permanent; otherwise, should be discarded.
6. it informs p ∈ ckpt_cohort of the decision7. The processes that receive the decision act accordingly
~ Synchronous Checkpoint ~
![Page 23: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/23.jpg)
Diagram of Optimized Algorithm
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2cb2
cd1
|
Tentative checkpoint
ca2
last_label_rcvdX[Y] >= first_label_sentY[X] > ⊥
2 >= 1 > 0|
2 >= 2 > 0|
2 >= 0 > 0
OK
decide to commit
[
Permanent checkpoint
[
[
ckpt_cohortX : { Y | last_label_rcvdX[Y] > }⊥
~ Synchronous Checkpoint ~
![Page 24: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/24.jpg)
Correctness
A set of permanent checkpoints taken by this algorithm is consistent No process sends messages after taking a
tentative checkpoint until the receipt of the decision
New checkpoints include no message from the processes that don’t take a checkpoint
The set of tentative checkpoints is fully either made to permanent checkpoints or discarded.
~ Synchronous Checkpoint ~
![Page 25: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/25.jpg)
Recovery Algorithm
Labeling Scheme⊥ : smallest labelт : largest label
last_label_rcvdX[Y] : the last message that X received from Y after X has taken its last permanent or tentative checkpoint. If not exists, is in it.⊥
first_label_sentX[Y] :the first message that X sent to Y after X took its last permanent or tentative checkpoint . If not exists, is in it.⊥
roll_cohortX :the set of all processes that may have to roll back to the latest checkpoint when process X rolls back.
last_label_sentX[Y] : the last message that X sent to Y before X takes its latest permanent checkpoint. If not exist, т is in it.
~ Synchronous Recovery ~
![Page 26: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/26.jpg)
Recovery Algorithm
roll_cohortX = { Y | X can send messages to Y }
Y will restart from the permanent checkpoint only if
last_label_rcvdY[X] > last_label_sentX[Y]
~ Synchronous Recovery ~
![Page 27: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/27.jpg)
Recovery AlgorithmAlgorithm
1. an initiator requests p ∈ roll_cohort to prepare to rollback ( this message includes last_label_sent[reciever] of sender )
2. if the processes that receive the request need to rollback, they do the same as 1.; otherwise, return OK message.
3. they wait for receiving OK from all of p ∈ ckpt_cohort.4. if the initiator learns p ∈ roll_cohort have succeeded, it
decides to rollback; otherwise, not to rollback.5. it informs p ∈ roll_cohort of the decision6. the processes that receive the decision act accordingly
~ Synchronous Recovery ~
![Page 28: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/28.jpg)
Diagram of Synchronous Recovery
[
[
[
[
A
C
B
D
ab1 ac1
bd1
dc1 dc2
cb1
ba1 ba2
ac2cb2
dc1
request to roll back
0 > 1
last_label_rcvdY[X] > last_label_sentX[Y]
2 > 1
0 >т
OK
[
[
2 > 1
0 >т
[
decide to roll back
roll_cohortX = { Y | X can send messages to Y }
![Page 29: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/29.jpg)
Drawbacks of Synchronous Approach
Additional messages are exchanged Synchronization delay An unnecessary extra load on the system if
failure rarely occurs
![Page 30: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/30.jpg)
Asynchronous Checkpoint
Characteristic Each process takes checkpoints independently No guarantee that a set of local checkpoints is
consistent A recovery algorithm has to search consistent set
of checkpoints
No additional message No synchronization delay Lighter load during normal excution
![Page 31: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/31.jpg)
Preliminary (Assumptions)
GoalTo find the latest consistent set of checkpoints
Assumptions Communication channels are FIFO Communication channels are reliable The underlying computation is event-driven
~ Asynchronous Checkpoint / Recovery~
![Page 32: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/32.jpg)
Preliminary (Two types of log)
save an event on the memory at receipt of messages (volatile log)
volatile log periodically flushed to the disk (stable log) ⇔ checkpoint
volatile log : quick access
lost if the corresponding processor fails
stable log : slow access
not lost even if processors fail
~ Asynchronous Checkpoint / Recovery~
![Page 33: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/33.jpg)
Preliminary (Definition)
DefinitionCkPti : the checkpoint (stable log) that i rolled back to when
failure occurs
RCVDi←j (CkPti / e ) :the number of messages received by processor i from processor j, per the information stored in the checkpoint CkPti or event e.
SENTi→j(CkPti / e ) :the number of messages sent by processor i to processor j, per the information stored in the checkpoint CkPti or event e
~ Asynchronous Checkpoint / Recovery~
![Page 34: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/34.jpg)
Recovery Algorithm
Algorithm1. When one process crashes, it recovers to the latest
checkpoint CkPt.2. It broadcasts the message that it had failed. Others
receive this message, and rollback to the latest event.3. Each process sends SENT(CkPt) to neighboring
processes4. Each process waits for SENT(CkPt) messages from every
neighbor5. On receiving SENTj→i(CkPtj) from j, if i notices RCVDi←j
(CkPti) > SENTj→i(CkPtj), it rolls back to the event e such that RCVDi←j (e) = SENTj→i(e),
6. repeat 3,4,and 5 N times (N is the number of processes)
~ Asynchronous Checkpoint / Recovery~
![Page 35: 12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14](https://reader036.vdocuments.us/reader036/viewer/2022062314/56649efe5503460f94c13a26/html5/thumbnails/35.jpg)
Asynchronous Recovery
X
Y
Z
Ex0 Ex1 Ex2 Ex3
Ey0 Ey1 Ey2 Ey3
Ez0 Ez1 Ez2
[
[
[
x1
y1
z1
(Y,2)
(Y,1)
(X,2)
(X,0)
(Z,0)
(Z,1)
3 <= 2
RCVDi←j (CkPti) <= SENTj→i(CkPtj)
2 <= 2
X:Y X:Z
0 <= 0
1 <= 2
Y:X
1 <= 1
Y:Z
0 <= 0
Z:X
2 <= 1
Z:Y
1 <= 1