chapter 7 synchronization. topics physical clock synchronization logical clock synchronization...

Chapter 7

Synchronization

Topics Physical clock synchronization Logical clock synchronization

Causality relation Lamport’s logical clock Vector logical clock Multicast ISIS vector clock

Snapshot

New Issues in DS Global time Event order

e1 at 1:00pm on machine m1, e2 at 1:01pm at machine m2. Which event happens earlier?

Global state snapshot

Mutual exclusion & Synchronization

Time and Clock Two roles of time - Defines temporal order among events - Duration (measured by timer) UTC (Universal Coordinated Time) is based on Cesium-133 atom oscillation; located

at over 200 labs in the world With satellites, 0.5ms accuracy is possible. (100 MIPS 50,000 instructions in

0.5ms).

Clock Skew Skew: Clock reading (from single

clock) is location-dependent, e.g., distance from satellite or clock source on a circuit board

Drift: Multiple clocks. t = the real time Cp(t) = the reading of a clock

p at time t (Cp(t)= t for ideal clock)

dCp(t) /dt = ticking rate (dCp(t) /dt = 1 for ideal clock)

Consequence: An Example When each machine has its own clock, an event

that occurred after another event may nevertheless be assigned an earlier time. But it is a different story in DS.

Time

Physical Clock Synchronization Cristian’s Algorithm The Berkeley Algorithm Network Time Protocol (NTP) OSF DCE

Cristian’s Algorithm: Architecture WWV-node receiving UTC-signals,

serving as the central UTC-time server (CUTCS) for the DS WWV is a short wave radio station in

Colorado. Periodically every node in the DS sends

a time request to the central UTC time server CUTCS.

The CUTCS responds with its current time tUTC

Adjust Time When the client gets the reply, it

simply set its clock to tUTC Time may run backward

Introduce the change gradually Consider the time for message

propagation.

Adjust Time (continued)

Tp

Tp

T0

T1

I

t

Client Server Estimate Tp (propagation delay) from T1 – T0 = 2 x Tp + I. where I = processing time. Current time = t (server’s time in message) + Tp.

Assumption?

The Berkeley Algorithm In Cristian’s algorithm the central time

server was passive Now it’s active, i.e. it periodically polls

all other nodes to hand out their current local times ci(t).

Based on the answers it calculates a mean and tells all other machines to advance or slow down their clocks accordingly.

Relative Clock Synchronization Time server periodically sends its time

to clients and asks for theirs. Clients respond with how far ahead or

behind the server they are Time server uses the estimated local

times for building the arithmetic mean Deviations from this arithmetic mean

are sent to nodes enabling them to slow down respectively to speed up.

The Berkeley Algorithm

a) The time daemon asks all the other machines for their clock valuesb) The machines answerc) The time daemon tells everyone how to adjust their clock

Summary Cristian’s method and the Berkeley

algorithm are intended for intranets Both may be improved with fault

tolerance methods Instead of one UTC server in Cristian’s

algorithm use n time servers and always take the first answer from whatever time serve

Instead of taking the arithmetic mean from all clients in the Berkeley algorithm take the fault tolerance mean, i.e. skip deviations with a certain threshold

Network Time Protocol (NTP) Goal

absolute (UTC)-time service in large nets (e.g. Internet) high availability (via fault tolerance) protection against fabrication (via authentication)

Architecture time-servers build up a hierarchical synchronization

subnet all primary servers have an UTC-receiver secondary servers are synchronized by their parent

primary server all other stations are leaves on level 3 being

synchronized by level 2 time servers accuracy of clocks decreases with increasing level

number the net is able to reconfigure

NTP

Reliability from redundant paths, scalability, authenticated time sources

Synchronization of Servers (NTP) Synchronization subnet can reconfigure if failures

occur,e.g. Primary having lost its UTC source can become a

secondary Secondary having lost its primary can use another one

Modes of synchronization: Multicast mode (for quick LANs, low accuracy)

A server within a LAN periodically multicasts time to other leaves in the LAN which set their clocks assuming some delay

Procedure-call mode (Cristian’s algorithm with medium accuracy)

A server responds to requests with its actual timestamp Symmetric mode (high accuracy)

Pairs of servers exchange message containing times

OSF DCE

TS1TS2TS3TS4

New time intervalReject

Time is an interval [t-e, t+e].

Two intervals overlap cannot say which time is earlier(In case of overlap, Unix make should recompile).

Logical Clock Synchronization A powerful building block in DS

Duplicate detection Cache consistency Leases Commitment …

Leslie Lamport

Time, Clocks, and the Ordering of Events in a Distributed System

Microsoft

Best known for his work on

1)Temporal logic

2)LaTeX

The Paper Handles problems of clock drift in a

DS Identifies main function of computer

clocks, i.e. ordering of events Indicates which conditions clocks

must satisfy to fulfill their role Introduces logical clocks Benefits of logical clocks?

Needed for determining causality

Logical Time In many cases it’s sufficient just to order

the related relevant events, i.e. we want to be able to position these events relatively, but not absolutely.

Interesting: Relative position of an event on the time axis no need for any scaling on this time axis

Ordering Events Event ordering linked with concept

of causality: Saying that event a happened before

event b is same as saying that event a could have affected the outcome of event b

If events a and b happen on processes that do not exchange any data, their exact ordering is not important.

Causality Relationship

P

Q

R

p1 p2p3 p4

q1

q2 q3 q4

r1 r2 r3 r4

q5

p1 q2p1p2transitivep1 r3

Time

• Event changes state of process.• State remains same till next event occurs.

Formal Definition a b defined by:

If a and b are in the same process, and a occurs earlier than b, then ab.

If a is a sending event and b is receiving event of same message, then a b.

If ab and bc, then ac. [Transitive] If a b, then a causally precedes (or

happened before) b; a and b are causally related

a and b are concurrent if neither ab nor ba.

Message-Related Events Sending event Receiving event Message arrival (at kernel) and

delivery (to user process): Kernel can

control timing of delivery after arrival.

Example

e11e12e21e22e32 , furthermore e31e32,

whereas e31 is neither related (has happened before) to e11, nor to e12, nor to e21, nor to e22.

e31 is concurrent to e11, e12, e21, and e22.

Time

Lamport Clock Suppose E= {events}, each e in E gets a

Lamport time stamp L(e), as follows: 1. e is a pure local event or a sending-event:

if e has no local predecessor, then L(e) := 1, otherwise there is a local predecessor e’, thus L(e) := L(e’) + 1

2. e is a receiving event, with a corresponding sending-event s: if e has no local predecessor, then L(e) = L(s) +1, otherwise there is a local predecessor e’, thus L(e) := max{L(s),L(e’)} + 1

Example

Note: Each local counter is incremented with each local event. In a communication we adjust the involved counters of the two communicating nodes to be consistent with the happened-before-relation.

Remark: Same mechanism can be used to adjust clocks on different nodes. The Lamport time is consistent with the happened-before-relation, i.e. if xy, then L( x)<L( y), but not vice versa.

Adjusting Clocks

Without adjusting local clocks

With adjusting local clocks

Limitation on Lamport Clocks

From Lamport time values you cannot conclude whether two events are in the happen-before relationship,e.g. e11 ande32.

Total Ordering of Events Lamport-time only gives us a partial-

ordering of distributed events. To implement the total ordering:

Each processor is assigned by a unique id (integer)

Given two events e1 and e2, e1 is ordered before e2 if L(e1) < L(e2) or L(e1)=L(e2) and Id(e1) < Id(e2)

Holding Back Deliveries Delay the delivery of messages

that arrived “too soon” Useful when delivering messages

from kernel to processes Hold back the delivery of M to process

P until there is a guarantee that no message M’ with L(M’) < L(M) will arrive at P in the future.

Implementation Assumption: messages from a particular

source arrive in the FIFO order Each site maintains a set of message

queues, one for each other site When a message arrives, placed in the

corresponding queue When all queues are non-empty, compare

the timestamps of the messages at the heads of the queues, and deliver the messages with the oldest timestamp.

Limitation All message queues need be non-

empty. Normally not true. Require multicast to solve the

problem. With Lamport clock, L(a)<L(b) does

not mean ab. Unnecessarily delay some messages. Vector Clock.

Event CounterP1 P2 P3

1

221

1

2

3

4

3Event counter at Pi : Initialized at 0 andincremented for each event

Vector Time Assumption: n tasks (processes) Pi in DS Each Pi has its own local clock being a

n-dimensional vector (initially zeroed) Vi(a) is timestamp of event a at process

Pi

Vi[i] = number of local events at Pi

Vi[j] is Pi best guess of how many events have been on Pj

Rules There is a DS with n distributed processes. n-

dimensional vector Vi is vector-time of process Pi if it is built according to the following rules:

(1) Initially, Vi = (0, …, 0) for all 1<=i<=n (2) For a local event on process Pi: Vi[i]++ (3) Pi includes the value t = Vi in every msg m (4) When Pj receives a message m with timestamp t,

it sets Vj[k]= max{t[k], Vj [k]}, for 1<=k<=n and k != j

Communication cost? Little overhead compared to Lamport clock

ExampleP1 P2 P3

100 001

300

200

242

243250

260450

550

000 000 000010

220

264273

230240

Time

M1

M2

M 3

Notation We define global V(e) = Vi(e) if event e

happens in Pi

We write V(a) V(b) if V(a)[k] V(b)[k] for all k. Here V(a)[k] denotes the kth component of

V(a). We write V(a) < V(b) if

V(a)[k] V(b)[k] for all k, and V(a)[j] < V(b)[j] for at least one j

Vector Time Characteristics The following inter relationships

between causality or the happened-before relation and vector-time hold: A.) ee’ iff V(e) < V(e) B.) e||e’ iff V(e) ||V(e’)

The vector-time is the best known estimation for global sequencing that is based only on local information.

Proof a b iff V(a) < V(b) Proof :For A fixed b, a b iff a is in

shaded area iff each component of V(a) corresponding component of V(b).

P1 P2 P3

t2

t1 t3

b

a

Multicasting A message is sent to all the

members of a group Sending video stream to a set of

customers Implementing a chat program Sending updates to a group of replica

managers

IPv4 Multicast Addresses Class D (starts with bit

sequence1110) 224.0.0.1 to 239.255.255.255

(about 228268 million) 224.0.0.1 is for “all systems on this

subnet” 224.2.0.0 ~ 224.2.127.253 are for

multimedia conference calls

Causal Ordering of Messages Suppose m1 and m2 are two

messages being received at the same node i. A set of messages is causally ordered if for all pairs <m1, m2> the following holds:

send(m1) send(m2) receive(m1)receive(m2)

Causality Violation Suppose M1’s sending event happened before M3’s sending

event. Causality violation occurs if M1 is delivered after M3 (In particular, non-FIFO delivery is causality violation).

P1 P2 P3Migratefoo to P2 “Do you have foo?”

“Given to P2 (=M2)”

“Do you have foo? (=M3)”M1

“Nope” Time

•Delay the delivery of M3 to P2 until M1 arrives.

•ISIS system using multicast

Formal Description of ISIS Clock ICi

Pi initializes its clock ICi = [0,…,0]. For each msg sending event by Pi

ICi[i]++ Pi attaches ICi to message it sends.

Upon receiving msg M from Pj with M.ts, Pi checks if 1) M.ts[j] == ICi[j] + 1 (M is next msg expected from

Pj) 2) ICi[k] M.ts[k] for all other k (all msgs from Pk that

sender Pj has received have been received by Pi) If both are satisfied, Pi delivers M after ICi[j]++ Otherwise, Pi puts M in hold-back Q until they are satisfied.

ExampleMigrate

foo to P2

“Where is foo?”

“foo is at P2”

P1 P2 P3

Time

100 001

201101

000

M1M2

M2.ts[1] > IC3[1]+1

Put M2 in Hold-back Q

M1.ts[1] = IC3[1]+1

IC3 101; deliver M1

IC3 201; deliver M2

• Note: jth component of M.ts is sequence number of latest msg sent

by Pj that is known to sender of M

Safety Show that msgs are delivered in timestamp order. Suppose not

Let m (m’) be event of sending message M (M’) Assume Pi delivered msg M (from Pk) before M’ (from Pj),

even though M’.ts (= ICj(m’)) < M.ts (=ICk(m)) …….(A) (1)

(a) Just before Pi delivered M’: ICi[j]+1 = ICj(m’)[j] hence ICi[j] < ICj(m’)[j] (2)

(b) Delivery of M would have resulted in ICi[j]* = ICk(m)[j]at time of delivery

(a) and (b) contradict (A) since (b) took place before (a), hence ICi[j]* ICi[j]

Liveness Show the system starvation-free:

no message will wait forever in the hold-back Q

Assume Q is the hold-back queue in Pi and is non-empty. Let M be a msg in Q which is not preceded by any other msg in Q. Suppose M was sent by Pj.

Proof Assume ICi[k] < M.ts[k] for some k (!=j), i.e., condition

(2) is violated; want to derive contradiction from this. Let M’ be latest msg from Pk that Pj delivered prior to

sending M so that M’.ts < M.ts and M’.ts[k] = M.ts[k]. If Pi hasn’t delivered copy of msg M’ from Pk , then M’

with M’.ts < M.ts is in holdback Q of Pi, contradicting assumption that M is not causally preceded by any other msg in holdback Q of Pi.

So Pi must have delivered copy of msg M’ from Pk.Thus ICi[k] M’.ts[k] = M.ts[k], contradicting ICi[k] < M.ts[k]

Must give up assumption that Pi cannot deliver M.

Proof Illustration

Pj Pk

Pi

M’

M’M

M.ts>M’.ts

Pi already delivered M’ ICi[k] M.ts[k]

Global state of a DS Consists of:

Local state of each node (task, process) Messages in transit

Why interested in a global state? Suppose local computation has stopped on each

node and there are no pending messages, then 1. Distributed application has terminated successfully?

or 2.Deadlock?

Problem: lack of global time Consequences?

How: take a snapshot

Snapshots (taken at 2:00pm by local clocks)

$100 $0

$100

$0 $100

$0

A B B BA A

1:59pm

2:01

(a) (b) (c)

sum = $100 sum = $0 sum = $200

Snapshots taken at

$100In channel$100 $100

2:00pm

2:00pm

Census Taking in Ancient Kingdom

Village

Village

Village

Village

Want to take census counting all people, some of whom may be traveling on highways.

Census Taking Algorithm Close all gates into/out of each village

(process) and count people (record process state) in village

Open each outgoing gate and send official with a red cap (special marker message).

Open each incoming gate and count all travelers (record channel state= messages sent but not received yet) who arrive ahead of official (with a red cap).

Tally the counts from all villages.

Chandy/Lamport Snapshot Algorithm

All processes are initially white: Messages sent by white(red) processes are also white (red)

MSend [Marker sending rule for process P] Suspend all other activities until done Record P’s state Turn red Send one marker over each output channel of P.

MReceive [Marker receiving rule for P] On receiving marker over channel C,

if P is white { Record state of channel C as empty; Invoke MSend; }

else record the state of C as sequence of white messages received since P turned red.

Stop when marker is received on each incoming channel MSend and MReceive are atomic.

Assumptions No process failures, no message

loss Point to point message delivery is

ordered How to guarantee it? ISIS clock

Network is strongly connected. Why?

Snapshot (1)

$100 $0 $0$0

A B BA

(a) (b)

sum = $100 sum = $100

OK OKNeed not use time.

$100 inchannel

msgs arrivingbefore maker

constitutechannel state

$100 inchannel

marker

Snapshots (2)

$100

$100

BA

(c)

sum = $200

Cannot happen

$100

$100

BA

(d)

sum = $100

$0

Will be like this

marker marker

marker$100 inchannel

Cuts Cut C divides all events to PC (those

which happened in the past relative to C) and FC (future events).

Cut C is consistent if there is no message whose sending event is in FC and whose receiving event is in PC.

Progress shown by cuts

P

Q

p1 p2 p3p4

q1 q2 q3

Time

There are 5*4 = 20 possible cuts.

1 2 3 4 5 7 8

Example

P

Q

R

p1 p2p3 p4

q1

q2 q3 q4

r1 r2 r3r4

q5

Time

consistent cut

inconsistent cut

M

State recorded by SNAPSHOT consistent cut

Checkpoint Cut C is consistent C doesn’t

contradict sequence of events experienced by any site can assume it did exist at the same time

Can use snapshot as checkpoint, from which activity in distributed system can be resumed after crash

chapter 7 synchronization. topics physical clock synchronization logical clock synchronization...

Documents