d u k e s y s t e m s asynchronous replicated state machines (causal multicast and all that) jeff...

D u k e S y s t e m s

AsynchronousReplicated State Machines

(Causal Multicast and All That)

Jeff ChaseDuke University

A Classic Paper

• Time, Clocks, and the Ordering of Events in Distributed Systems

• Leslie Lamport, CACM, July 1978

• Introduced:– logical clocks (“Lamport clocks”)

– state-machine replication model

– Causal multicast

– physical clocks that respect causality

– (Almost but not) vector clocks

Concurrency and time

A

B

CC

What do these words mean?after?last?subsequent?eventually?

Same world, different timelines

Which of these happened first?

A

B

W(x)=v

R(x) R(x)

Message send

Message receive

“Event e1a wrote W(x)=v”

e1a

e1b

e2

e3b e4

e3a

e1a is concurrent with e1be3a is concurrent with e3bThis is a partial order of events.

Concurrency and time

• Premise: nodes communicate only by messages.

• Nodes can observe a remote event only through some chain of messages.– Message patterns define the observability of events.

• Event e1 can affect or cause event e2 iff the node initiating e2 could have already observed e1.

– Message patterns define a potential causality relation.

– Also known as happened-before ().

• Event e1 precedes e2 iff e1 could have caused e2.

– We can view the potential causality relation as logical time.

• Events are concurrent if neither precedes the other.

Axiom 1: happened-before ()

C

A

B

C

1. If e1, e2 are in the same process/node, and e1 comes before e2, then e1e2.

- Also called program order

Time, Clocks, and the Ordering of Events in Distributed systems, by Leslie Lamport, CACM 21(7), July 1978


C

A

B

C

2. If e1 is a message send, and e2 is the corresponding receive, then e1e2.



C

A

B

C

3. is transitive happened-before is the transitive closure of

the relation defined by #1 and #2


Potential causality: example

A

B

C

A1 A2

A3 A4

B1 B2 B3 B4

C1 C2 C3

A1 < B2 < C2

B3 < A3

C2 < A4

Logical clocks

We need clocks that reflect logical time. Start with logical clocks:

1. Each node maintains a monotonically increasing logical clock value LC.

2. Events are timestamped with the current LC value at the generating node.– Increment LC on each new event: LC = LC + 1

3. Piggyback current clock value on all messages.– On receive, advance receiver’s LC to sender’s LC.

– If LCs > LC then LC = LCs + 1

The Clock Condition

• e1 e2 implies that LC(e1) < LC(e2)

• LC ordering respects potential causality.

• Is the converse also true?

• i.e., Does LC(e1) < LC(e2) imply that e1 e2?

• What if e1 and e2 are concurrent?

Logical clocks: example

0

0

0

1 2 3 9

61 5 8

2 3 4 6 7

4 5 6 7 8 10

7

5

A

B

C

C5: LC update advances receiver’s clock if it is “running slow” relative to sender.

A6-A10: receiver’s clock is unaffected because it is “running fast” relative to sender.

Potential causality

A

SS

C

W(x)=v

R(x) v?

OK “Try it now”


If nodes observe events in an order that violates causality, they may perceive an inconsistency.

Causal consistency

A

SS

C

W(x)=v

R(x) 0????

This ordering violates causal consistency.

“Try it now”


Same world, unified timelines?

A

B

W(x)=v

R(x) R(x)

e1b

e2

e4

e3a

This is a total order of events.Also called a sequential schedule.It is consistent with the partial order induced by happened-before (causal order).

External witness

e1a

e5

X

e3b

Same world, unified timelines?

A

B

W(x)=v

R(x) R(x)

e1b

e2

e3b

e4

e3a

Here is another total order of the same events.Like the last one, it is consistent with the partial order: it does not change any existing orderings; it only assigns orderings to events that are concurrent in the partial order.

External witness

e1aX

X

Replicated State Machines, revisited

Replicas are “consistent” if all replicas apply the same (deterministic) updates in the same total order.

Suppose a client can read from any replica while writes are propagating. Does the order matter?

opA opA opBopB

opA opB opB opA

Challenge: A Decentralized Mutex

Implement a distributed mutex, respecting these conditions:1. Resource holder must release before algorithm

grants the resource to another Process;

2. Different acquire requests must be granted “in the order in which they are made”;

3. If every holder eventually releases, then every request is eventually granted.

Note: ordering is arbitrary for concurrent requests.

Definition of a lock (mutex)

• Acquire + release ops on L are strictly paired.– After acquire completes, the caller holds (owns)

the lock L until the matching release.

• Acquire + release pairs on each L are ordered.– Total order: each lock L has at most one holder.

– That property is mutual exclusion; L is a mutex.

Lamport’s Mutex Problem

• The trick here is to implement a mutex service in a decentralized way, without a central lock server.

• Nodes/processes exchange messages and execute requests locally in the same order.– Replicated State Machine

• Lamport’s condition II says the order must be causal.– (why?).

Lamport’s Proposed Solution

• Process group of N processes/nodes/peers with unique IDs.

• Each process is a state machine:– Two operations: request (acquire) and release

– Internal state per-process: request queue

• Basic idea:– Requester sends request to all peers (multicast)

– Peers have some means to order the requests: receive code may defer or reorder messages before delivery.

– All peers agree on the same order of delivery.

– Therefore, all peers agree on the sequence of acquisitions.

Causal Multicast (“cbcast”)1. Assume FIFO delivery at transport: messages sent by P are

received by others in their send order (program order).

2. Sender timestamps each messages with its logical clock.

3. Receiver acknowledges a received request with a message back to the sender, timestamped with logical clock of receiver (ack sender).

Propagates knowledge of peer timestamp values.

4. The logical clocks induce a partial order on the messages. To make it a total order: if two messages have the same timestamp, use the unique process ID to break the tie.

This total order is “arbitrary”, but it respects potential causality.

5. A node delivers a request R to the local application when R is stable: it has the earliest timestamp T of all undelivered requests, and there can be no preceding request still in the network (e.g., the node has received a later-timestamped message from every peer).

The Lamport Mutex Algorithm

1. Use causal multicast.

2. Processes cache the acquire requests they have received on their request queue, in timestamp order.• Including their own acquire requests

3. If a release message is received, remove the corresponding request from the queue.

4. Take the lock when your request is at the front of the queue.• This request is next, and it is stable.

Enhanced causal multicast: Isis• Later refinements to causal multicast (“cbcast”) improved

concurrency (e.g., in Birman’s Isis group system).– Lamport’s approach delivers messages in a total order.

– Lamport’s use of logical clocks imposes an ordering in some instances where causality does not require it.

– Isis advanced cbcast (causal broadcast): deliver concurrent messages in any order; never delay your own messages to self; use various batching optimizations for message bursts.

– How to know if messages are concurrent?

• Lamport’s state-machine service doesn’t handle failures: addressed in later work (e.g., Isis again).– Requires failure detection, uniform atomicity, and views.

d u k e s y s t e m s asynchronous replicated state machines (causal multicast and all that) jeff...

Documents

ordering of events

e2 iff e1

lcs lc

clocks concurrency

logical clock value

partial order of events

increment lc

senders lc