chapter 7 synchronization. topics physical clock synchronization logical clock synchronization...
Post on 18-Dec-2015
250 views
TRANSCRIPT
Chapter 7
Synchronization
Topics Physical clock synchronization Logical clock synchronization
Causality relation Lamport’s logical clock Vector logical clock Multicast ISIS vector clock
Snapshot
New Issues in DS Global time Event order
e1 at 1:00pm on machine m1, e2 at 1:01pm at machine m2. Which event happens earlier?
Global state snapshot
Mutual exclusion & Synchronization
Time and Clock Two roles of time - Defines temporal order among events - Duration (measured by timer) UTC (Universal Coordinated Time) is based on Cesium-133 atom oscillation; located
at over 200 labs in the world With satellites, 0.5ms accuracy is possible. (100 MIPS 50,000 instructions in
0.5ms).
Clock Skew Skew: Clock reading (from single
clock) is location-dependent, e.g., distance from satellite or clock source on a circuit board
Drift: Multiple clocks. t = the real time Cp(t) = the reading of a clock
p at time t (Cp(t)= t for ideal clock)
dCp(t) /dt = ticking rate (dCp(t) /dt = 1 for ideal clock)
Consequence: An Example When each machine has its own clock, an event
that occurred after another event may nevertheless be assigned an earlier time. But it is a different story in DS.
Time
Physical Clock Synchronization Cristian’s Algorithm The Berkeley Algorithm Network Time Protocol (NTP) OSF DCE
Cristian’s Algorithm: Architecture WWV-node receiving UTC-signals,
serving as the central UTC-time server (CUTCS) for the DS WWV is a short wave radio station in
Colorado. Periodically every node in the DS sends
a time request to the central UTC time server CUTCS.
The CUTCS responds with its current time tUTC
Adjust Time When the client gets the reply, it
simply set its clock to tUTC Time may run backward
Introduce the change gradually Consider the time for message
propagation.
Adjust Time (continued)
Tp
Tp
T0
T1
I
t
Client Server Estimate Tp (propagation delay) from T1 – T0 = 2 x Tp + I. where I = processing time. Current time = t (server’s time in message) + Tp.
Assumption?
The Berkeley Algorithm In Cristian’s algorithm the central time
server was passive Now it’s active, i.e. it periodically polls
all other nodes to hand out their current local times ci(t).
Based on the answers it calculates a mean and tells all other machines to advance or slow down their clocks accordingly.
Relative Clock Synchronization Time server periodically sends its time
to clients and asks for theirs. Clients respond with how far ahead or
behind the server they are Time server uses the estimated local
times for building the arithmetic mean Deviations from this arithmetic mean
are sent to nodes enabling them to slow down respectively to speed up.
The Berkeley Algorithm
a) The time daemon asks all the other machines for their clock valuesb) The machines answerc) The time daemon tells everyone how to adjust their clock
Summary Cristian’s method and the Berkeley
algorithm are intended for intranets Both may be improved with fault
tolerance methods Instead of one UTC server in Cristian’s
algorithm use n time servers and always take the first answer from whatever time serve
Instead of taking the arithmetic mean from all clients in the Berkeley algorithm take the fault tolerance mean, i.e. skip deviations with a certain threshold
Network Time Protocol (NTP) Goal
absolute (UTC)-time service in large nets (e.g. Internet) high availability (via fault tolerance) protection against fabrication (via authentication)
Architecture time-servers build up a hierarchical synchronization
subnet all primary servers have an UTC-receiver secondary servers are synchronized by their parent
primary server all other stations are leaves on level 3 being
synchronized by level 2 time servers accuracy of clocks decreases with increasing level
number the net is able to reconfigure
NTP
Reliability from redundant paths, scalability, authenticated time sources
Synchronization of Servers (NTP) Synchronization subnet can reconfigure if failures
occur,e.g. Primary having lost its UTC source can become a
secondary Secondary having lost its primary can use another one
Modes of synchronization: Multicast mode (for quick LANs, low accuracy)
A server within a LAN periodically multicasts time to other leaves in the LAN which set their clocks assuming some delay
Procedure-call mode (Cristian’s algorithm with medium accuracy)
A server responds to requests with its actual timestamp Symmetric mode (high accuracy)
Pairs of servers exchange message containing times
OSF DCE
TS1TS2TS3TS4
New time intervalReject
Time is an interval [t-e, t+e].
Two intervals overlap cannot say which time is earlier(In case of overlap, Unix make should recompile).
Logical Clock Synchronization A powerful building block in DS
Duplicate detection Cache consistency Leases Commitment …
Leslie Lamport
Time, Clocks, and the Ordering of Events in a Distributed System
Microsoft
Best known for his work on
1)Temporal logic
2)LaTeX
The Paper Handles problems of clock drift in a
DS Identifies main function of computer
clocks, i.e. ordering of events Indicates which conditions clocks
must satisfy to fulfill their role Introduces logical clocks Benefits of logical clocks?
Needed for determining causality
Logical Time In many cases it’s sufficient just to order
the related relevant events, i.e. we want to be able to position these events relatively, but not absolutely.
Interesting: Relative position of an event on the time axis no need for any scaling on this time axis
Ordering Events Event ordering linked with concept
of causality: Saying that event a happened before
event b is same as saying that event a could have affected the outcome of event b
If events a and b happen on processes that do not exchange any data, their exact ordering is not important.
Causality Relationship
P
Q
R
p1 p2p3 p4
q1
q2 q3 q4
r1 r2 r3 r4
q5
p1 q2p1p2transitivep1 r3
Time
• Event changes state of process.• State remains same till next event occurs.
Formal Definition a b defined by:
If a and b are in the same process, and a occurs earlier than b, then ab.
If a is a sending event and b is receiving event of same message, then a b.
If ab and bc, then ac. [Transitive] If a b, then a causally precedes (or
happened before) b; a and b are causally related
a and b are concurrent if neither ab nor ba.
Message-Related Events Sending event Receiving event Message arrival (at kernel) and
delivery (to user process): Kernel can
control timing of delivery after arrival.
Example
e11e12e21e22e32 , furthermore e31e32,
whereas e31 is neither related (has happened before) to e11, nor to e12, nor to e21, nor to e22.
e31 is concurrent to e11, e12, e21, and e22.
Time
Lamport Clock Suppose E= {events}, each e in E gets a
Lamport time stamp L(e), as follows: 1. e is a pure local event or a sending-event:
if e has no local predecessor, then L(e) := 1, otherwise there is a local predecessor e’, thus L(e) := L(e’) + 1
2. e is a receiving event, with a corresponding sending-event s: if e has no local predecessor, then L(e) = L(s) +1, otherwise there is a local predecessor e’, thus L(e) := max{L(s),L(e’)} + 1
Example
Note: Each local counter is incremented with each local event. In a communication we adjust the involved counters of the two communicating nodes to be consistent with the happened-before-relation.
Remark: Same mechanism can be used to adjust clocks on different nodes. The Lamport time is consistent with the happened-before-relation, i.e. if xy, then L( x)<L( y), but not vice versa.
Adjusting Clocks
Without adjusting local clocks
With adjusting local clocks
Limitation on Lamport Clocks
From Lamport time values you cannot conclude whether two events are in the happen-before relationship,e.g. e11 ande32.
Total Ordering of Events Lamport-time only gives us a partial-
ordering of distributed events. To implement the total ordering:
Each processor is assigned by a unique id (integer)
Given two events e1 and e2, e1 is ordered before e2 if L(e1) < L(e2) or L(e1)=L(e2) and Id(e1) < Id(e2)
Holding Back Deliveries Delay the delivery of messages
that arrived “too soon” Useful when delivering messages
from kernel to processes Hold back the delivery of M to process
P until there is a guarantee that no message M’ with L(M’) < L(M) will arrive at P in the future.
Implementation Assumption: messages from a particular
source arrive in the FIFO order Each site maintains a set of message
queues, one for each other site When a message arrives, placed in the
corresponding queue When all queues are non-empty, compare
the timestamps of the messages at the heads of the queues, and deliver the messages with the oldest timestamp.
Limitation All message queues need be non-
empty. Normally not true. Require multicast to solve the
problem. With Lamport clock, L(a)<L(b) does
not mean ab. Unnecessarily delay some messages. Vector Clock.
Event CounterP1 P2 P3
1
221
1
2
3
4
3Event counter at Pi : Initialized at 0 andincremented for each event
Vector Time Assumption: n tasks (processes) Pi in DS Each Pi has its own local clock being a
n-dimensional vector (initially zeroed) Vi(a) is timestamp of event a at process
Pi
Vi[i] = number of local events at Pi
Vi[j] is Pi best guess of how many events have been on Pj
Rules There is a DS with n distributed processes. n-
dimensional vector Vi is vector-time of process Pi if it is built according to the following rules:
(1) Initially, Vi = (0, …, 0) for all 1<=i<=n (2) For a local event on process Pi: Vi[i]++ (3) Pi includes the value t = Vi in every msg m (4) When Pj receives a message m with timestamp t,
it sets Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j
Communication cost? Little overhead compared to Lamport clock
ExampleP1 P2 P3
100 001
300
200
242
243250
260450
550
000 000 000010
220
264273
230240
Time
M1
M2
M 3
Notation We define global V(e) = Vi(e) if event e
happens in Pi
We write V(a) V(b) if V(a)[k] V(b)[k] for all k. Here V(a)[k] denotes the kth component of
V(a). We write V(a) < V(b) if
V(a)[k] V(b)[k] for all k, and V(a)[j] < V(b)[j] for at least one j
Vector Time Characteristics The following inter relationships
between causality or the happened-before relation and vector-time hold: A.) ee’ iff V(e) < V(e) B.) e||e’ iff V(e) ||V(e’)
The vector-time is the best known estimation for global sequencing that is based only on local information.
Proof a b iff V(a) < V(b) Proof :For A fixed b, a b iff a is in
shaded area iff each component of V(a) corresponding component of V(b).
P1 P2 P3
t2
t1 t3
b
a
Multicasting A message is sent to all the
members of a group Sending video stream to a set of
customers Implementing a chat program Sending updates to a group of replica
managers
IPv4 Multicast Addresses Class D (starts with bit
sequence1110) 224.0.0.1 to 239.255.255.255
(about 228268 million) 224.0.0.1 is for “all systems on this
subnet” 224.2.0.0 ~ 224.2.127.253 are for
multimedia conference calls
Causal Ordering of Messages Suppose m1 and m2 are two
messages being received at the same node i. A set of messages is causally ordered if for all pairs <m1, m2> the following holds:
send(m1) send(m2) receive(m1)receive(m2)
Causality Violation Suppose M1’s sending event happened before M3’s sending
event. Causality violation occurs if M1 is delivered after M3 (In particular, non-FIFO delivery is causality violation).
P1 P2 P3Migratefoo to P2 “Do you have foo?”
“Given to P2 (=M2)”
“Do you have foo? (=M3)”M1
“Nope” Time
•Delay the delivery of M3 to P2 until M1 arrives.
•ISIS system using multicast
Formal Description of ISIS Clock ICi
Pi initializes its clock ICi = [0,…,0]. For each msg sending event by Pi
ICi[i]++ Pi attaches ICi to message it sends.
Upon receiving msg M from Pj with M.ts, Pi checks if 1) M.ts[j] == ICi[j] + 1 (M is next msg expected from
Pj) 2) ICi[k] M.ts[k] for all other k (all msgs from Pk that
sender Pj has received have been received by Pi) If both are satisfied, Pi delivers M after ICi[j]++ Otherwise, Pi puts M in hold-back Q until they are satisfied.
ExampleMigrate
foo to P2
“Where is foo?”
“foo is at P2”
P1 P2 P3
Time
100 001
201101
000
M1M2
M2.ts[1] > IC3[1]+1
Put M2 in Hold-back Q
M1.ts[1] = IC3[1]+1
IC3 101; deliver M1
IC3 201; deliver M2
• Note: jth component of M.ts is sequence number of latest msg sent
by Pj that is known to sender of M
Safety Show that msgs are delivered in timestamp order. Suppose not
Let m (m’) be event of sending message M (M’) Assume Pi delivered msg M (from Pk) before M’ (from Pj),
even though M’.ts (= ICj(m’)) < M.ts (=ICk(m)) …….(A) (1)
(a) Just before Pi delivered M’: ICi[j]+1 = ICj(m’)[j] hence ICi[j] < ICj(m’)[j] (2)
(b) Delivery of M would have resulted in ICi[j]* = ICk(m)[j]at time of delivery
(a) and (b) contradict (A) since (b) took place before (a), hence ICi[j]* ICi[j]
Liveness Show the system starvation-free:
no message will wait forever in the hold-back Q
Assume Q is the hold-back queue in Pi and is non-empty. Let M be a msg in Q which is not preceded by any other msg in Q. Suppose M was sent by Pj.
Proof Assume ICi[k] < M.ts[k] for some k (!=j), i.e., condition
(2) is violated; want to derive contradiction from this. Let M’ be latest msg from Pk that Pj delivered prior to
sending M so that M’.ts < M.ts and M’.ts[k] = M.ts[k]. If Pi hasn’t delivered copy of msg M’ from Pk , then M’
with M’.ts < M.ts is in holdback Q of Pi, contradicting assumption that M is not causally preceded by any other msg in holdback Q of Pi.
So Pi must have delivered copy of msg M’ from Pk.Thus ICi[k] M’.ts[k] = M.ts[k], contradicting ICi[k] < M.ts[k]
Must give up assumption that Pi cannot deliver M.
Proof Illustration
Pj Pk
Pi
M’
M’M
M.ts>M’.ts
Pi already delivered M’ ICi[k] M.ts[k]
Global state of a DS Consists of:
Local state of each node (task, process) Messages in transit
Why interested in a global state? Suppose local computation has stopped on each
node and there are no pending messages, then 1. Distributed application has terminated successfully?
or 2.Deadlock?
Problem: lack of global time Consequences?
How: take a snapshot
Snapshots (taken at 2:00pm by local clocks)
$100 $0
$100
$0 $100
$0
A B B BA A
1:59pm
2:01
(a) (b) (c)
sum = $100 sum = $0 sum = $200
Snapshots taken at
$100In channel$100 $100
2:00pm
2:00pm
Census Taking in Ancient Kingdom
Village
Village
Village
Village
Want to take census counting all people, some of whom may be traveling on highways.
Census Taking Algorithm Close all gates into/out of each village
(process) and count people (record process state) in village
Open each outgoing gate and send official with a red cap (special marker message).
Open each incoming gate and count all travelers (record channel state= messages sent but not received yet) who arrive ahead of official (with a red cap).
Tally the counts from all villages.
Chandy/Lamport Snapshot Algorithm
All processes are initially white: Messages sent by white(red) processes are also white (red)
MSend [Marker sending rule for process P] Suspend all other activities until done Record P’s state Turn red Send one marker over each output channel of P.
MReceive [Marker receiving rule for P] On receiving marker over channel C,
if P is white { Record state of channel C as empty; Invoke MSend; }
else record the state of C as sequence of white messages received since P turned red.
Stop when marker is received on each incoming channel MSend and MReceive are atomic.
Assumptions No process failures, no message
loss Point to point message delivery is
ordered How to guarantee it? ISIS clock
Network is strongly connected. Why?
Snapshot (1)
$100 $0 $0$0
A B BA
(a) (b)
sum = $100 sum = $100
OK OKNeed not use time.
$100 inchannel
msgs arrivingbefore maker
constitutechannel state
$100 inchannel
marker
Snapshots (2)
$100
$100
BA
(c)
sum = $200
Cannot happen
$100
$100
BA
(d)
sum = $100
$0
Will be like this
marker marker
marker$100 inchannel
Cuts Cut C divides all events to PC (those
which happened in the past relative to C) and FC (future events).
Cut C is consistent if there is no message whose sending event is in FC and whose receiving event is in PC.
Progress shown by cuts
P
Q
p1 p2 p3p4
q1 q2 q3
Time
There are 5*4 = 20 possible cuts.
1 2 3 4 5 7 8
Example
P
Q
R
p1 p2p3 p4
q1
q2 q3 q4
r1 r2 r3r4
q5
Time
consistent cut
inconsistent cut
M
State recorded by SNAPSHOT consistent cut
Checkpoint Cut C is consistent C doesn’t
contradict sequence of events experienced by any site can assume it did exist at the same time
Can use snapshot as checkpoint, from which activity in distributed system can be resumed after crash