aos lectures

BITS Pilani

Pilani Campus

Advance Operating System

(CS ZG623)

First Semester 2014-2015

PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 1

Sunday, Aug 03 - 2014

Introduction to Distributed Systems

Theoretical Foundations

Distributed Mutual Exclusion

Distributed Deadlock

Agreement Protocols

Distributed File Systems

Distributed Scheduling

Distributed Shared Memory

Recovery

Fault tolerance

Protection and Security

Course Overview

Mid-Semester

Text and References

Text Book:

Advanced Concepts in Operating Systems:

Mukesh Singhal & Shivaratri, Tata Mc Graw Hill

References:

1. Distributed Operating Systems: P. K. Sinha, PHI

2. Distributed Operating Systems: The Logical Design, A.

Goscinski, AW

3. Modern Operating Systems: A.S.Tanenbaum & VAN, PHI

4. Distributed Systems: Concepts and Design: G. Coluris, AW

EC No. Evaluation

Component &

Type of

Examination

Duration Weigh-

tage

Day, Date, Session,Time

EC-1 Quiz ** Details to be announced

on LMS Taxila

15% ** Details to be announced

on LMS Taxila

EC-2 Mid-Semester Test

(Closed Book)*

2 Hours 35%

EC-3 Comprehensive

Exam

(Open Book)*

3 Hours 50%

Evaluation Components

Distributed Systems

A Distributed System is a collection of independent computers that appears to its users

as a single coherent system [Tanenbaum]

A Distributed System is

- a system having several computers that do

not share a memory or a clock

- Communication is via message passing

- Each computer has its own OS+Memory

[Shivaratri & Singhal]

Multiprocessor System Architecture

Types

Tightly Coupled Systems

Loosely Coupled Systems

Tightly Coupled Systems

Systems with a single system wide memory Parallel Processing System , SMMP (shared memory multiprocessor systems)

CPU Shared memory

CPU CPU CPU

Interconnection hardware

Loosely Coupled System

Distributed Memory Systems (DMS) Communication via Message Passing

CPU

Local

memory

CPU

Local

memory

CPU

Local

memory

CPU

Local

memory

Communication network

Motivation

Resource Sharing

Enhanced Performance

Improved Reliability & Availability

Modular expandability

Distributed System Architecture Types

Minicomputer Model

Workstation Model

Workstation Server Model

Processor Pool Model

Hybrid Model

Mini-

computer

Mini-

computer

Mini-

computer Communication

network

Terminals

MINICOMPUTER MODEL

WORKSTATION MODEL

Workstation

Communication

networkWorkstationWorkstation

WorkstationWorkstation

Workstation

Communication

networkWorkstationWorkstation

WorkstationWorkstation

Mini-

computer

used as

file

serv er

Mini-

computer

used as

database

serv er

Mini-

computer

used as

print

serv er

. . .Mini-

computer

used as

file

serv er

WORKSTATION SERVERMODEL

File

serv er

Run

serv er

Communication

network

Pool of processors.

...

Terminals

Processor Pool Model

Hybrid Model

Based upon workstation-server model but with additional pool of processors

Processors in the pool can be allocated dynamically

Gives guaranteed response time to interactive jobs

More expensive to build

Distributed OS

A distributed OS is one that looks to its users like an centralized OS but runs on

multiple, independent CPUs. The key

concept is transparency. In other words,

the use of multiple processors should be

invisible to the user.

[Tanenbaum & Van Renesse]

Issues

Global knowledge

Naming

Scalability

Compatibility

Process Synchronization

Resource Management

Security

Structuring

Global Knowledge

No Global Memory

No Global Clock

Unpredictable Message Delays

Naming

Name refers to objects [ Files, Computers etc]

Name Service Maps logical name to physical address

Techniques

- LookUp Tables [Directories]

- Algorithmic

- Combination of above two

Scalability

Grow with time

Scaling Dimensions Size, Geographically & Administratively

Techniques Hiding Communication Latencies, Distribution & Caching

Scaling Techniques (1)

Hide Communication Latencies

1.4

The difference between letting:

a) a server or

b) a client check forms as they are being filled Scalability cont.


Distribution

1.5

An example of dividing the DNS name space into zones.

Scalability cont.


Replication

Replicate components across the distributed system

Replication increases availability , balancing load distribution

Consistency problem has to be handled

Compatibility

Interoperability among resources in system

Levels of Compatibility Binary Level, Execution Level & Protocol Level

Process Synchronization

Difficult because of unavailability of shared memory

Mutual Exclusion Problem

Resource Management

Make both local and remote resources available

Specific Location of resource should be hidden from user

Techniques

- Data Migration [DFS, DSM]

- Computation Migration [RPC]

- Distributed Scheduling [ Load Balancing]

Security

Authentication a entity is what it claims to be

Authorization what privileges an entity has and making only those privileges

available

Structuring

Techniques

- Monolithic Kernel

- Collective Kernel [Microkernel based , Mach, V-

Kernel, Chorus and Galaxy]

- Object Oriented OS [Services are implemented

as objects, Eden, Choices, x-kernel, Medusa,

Clouds, Amoeba & Muse]

- Client-Server Computing Model [Processes are

categorized as servers and clients]

Communication Primitives


High level constructs [Helps the program in using underlying communication

network]

Two Types of Communication Models

- Message passing

- Remote Procedure Calls

Message Passing

Two basic communication primitives

- SEND(a,b) , a Message , b Destination

- RECEIVE(c,d), c Source , d Buffer for storing the message

Client-Server Computation Model

- Client sends Message to server and waits

- Server replies after computation

Design Issues

Blocking vs Non blocking primitives

Nonblocking

- SEND primitive return the control to the user process as soon as the

message is copied from user buffer to kernel buffer

- Advantage : Programs have maximum flexibility in performing

computation and communication in any order

- Drawback Programming becomes tricky and difficult

Blocking

- SEND primitive does not return the control to the user process until

message has been sent or acknowledgement has been received

- Advantage : Programs behavior is predictable

- Drawback Lack of flexibility in programming

Design Issues cont..

Synchronous vs Asynchronous Primitives

Synchronous

- SEND primitive is blocked until corresponding RECEIVE

primitive is executed at the target computer

Asynchronous

- Messages are buffered

- SEND primitive does not block even if there is no

corresponding execution of the RECEIVE primitive

- The corresponding RECEIVE primitive can be either

blocking or non-blocking

Details to be handled in Message

Passing

Pairing of Response with Requests

Data Representation

Sender should know the address of Remote machine

Communication and System failures

Remote Procedure Call (RPC)

RPC is an interaction between a client and a server

Client invokes procedure on sever

Server executes the procedure and pass the result back to client

Calling process is suspended and proceeds only after getting the result from

server

RPC Design issues

Structure

Binding

Parameter and Result Passing

Error handling, semantics and Correctness

Structure RPC mechanism is based upon stub procedures.

Unpack Execute

Procedure

Stub

Procedure Remote

Procedure

Return

Pack

result

Local

Procedure

Call

Local

Procedure

Call

User

Program Stub

Procedure

Pack

parameters

& Transmit

wait

Unpack

Result

Return

from local

call

Client Machine

Server Machine

Binding

Determines remote procedure and machine on which it will be executed

Check compatibility of the parameters passed

Use Binding Server

Binding

Binding Server

Receive

Query

Return

Server

Address Unpack

Stub

Procedure Remote

Procedure

Return

Pack

result

Local

Procedure

Call

Local

Procedure

Call

Binding

Server

User

Program Stub

Procedure

Pack

parameters

& Transmit

wait

Unpack

Result

Return

from local

call

Client Machine

Server Machine

Registering

Services

1

2

3

4 5

6 7

8


Stub Procedures Convert Parameters & Result to appropriate form

Pack parameters into a buffer

Receiver Stub Unpacks the parameters

Expensive if done on every call

- Send Parameters along with code that helps to identify

format so that receiver can do conversion

- Alternatively Each data type may have a standard format.

Sender will convert data to standard format and receiver

will convert from standard format to its local

representation

Passing Parameters by Reference

Error handling, Semantics and

Correctness RPC may fail either due to computer or communication

failure

If the remote server is slow the program invoking remote procedure may call it twice.

If client crashes after sending RPC message

If client recovers quickly after crash and reissues RPC

Orphan Execution of Remote procedures

RPC Semantics

- At least once semantics

- Exactly Once

- At most once

Correctness Condition

Given by Panzieri & Srivastava

Let Ci denote call made by machine & Wi represents corresponding computation

If C2 happened after C1 ( C1 C2) & Computations W2 & W1 share the same

data, then to be correct in the presence of

failures RPC should satisfy

C1 C2 implies W1 W2

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 2

Sunday, Aug 09 - 2014


High level constructs [Helps the program in using underlying communication

network]

Two Types of Communication Models

- Message passing

- Remote Procedure Calls

Message Passing

Two basic communication primitives

- SEND(a,b) , a Message , b Destination

- RECEIVE(c,d), c Source , d Buffer for storing the message

Client-Server Computation Model

- Client sends Message to server and waits

- Server replies after computation

Design Issues

Blocking vs Non blocking primitives

Nonblocking

- SEND primitive return the control to the user process as soon as the

message is copied from user buffer to kernel buffer

- Advantage : Programs have maximum flexibility in performing

computation and communication in any order

- Drawback Programming becomes tricky and difficult

Blocking

- SEND primitive does not return the control to the user process until

message has been sent or acknowledgement has been received

- Advantage : Programs behavior is predictable

- Drawback Lack of flexibility in programming

Design Issues cont..

Synchronous vs Asynchronous Primitives

Synchronous

- SEND primitive is blocked until corresponding RECEIVE

primitive is executed at the target computer

Asynchronous

- Messages are buffered

- SEND primitive does not block even if there is no

corresponding execution of the RECEIVE primitive

- The corresponding RECEIVE primitive can be either

blocking or non-blocking

Details to be handled in Message

Passing

Pairing of Response with Requests

Data Representation

Sender should know the address of Remote machine

Communication and System failures

Remote Procedure Call (RPC)

RPC is an interaction between a client and a server

Client invokes procedure on sever

Server executes the procedure and pass the result back to client

Calling process is suspended and proceeds only after getting the result from

server

RPC Design issues

Structure

Binding


Error handling, semantics and Correctness

Structure RPC mechanism is based upon stub procedures.

Unpack Execute

Procedure

Stub

Procedure Remote

Procedure

Return

Pack

result

Local

Procedure

Call

Local

Procedure

Call

User

Program Stub

Procedure

Pack

parameters

& Transmit

wait

Unpack

Result

Return

from local

call

Client Machine

Server Machine

1

2

3 4 5

6 7

8

Binding

Determines remote procedure and machine on which it will be executed

Check compatibility of the parameters passed

Use Binding Server

Binding

Binding Server

Receive

Query

Return

Server

Address Unpack

Stub

Procedure Remote

Procedure

Return

Pack

result

Local

Procedure

Call

Local

Procedure

Call

Binding

Server

User

Program Stub

Procedure

Pack

parameters

& Transmit

wait

Unpack

Result

Return

from local

call

Client Machine

Server Machine

Registering

Services

1

2

3

4 5

6 7

8


Stub Procedures Convert Parameters & Result to appropriate form

Pack parameters into a buffer

Receiver Stub Unpacks the parameters

Expensive if done on every call

- Send Parameters along with code that helps to identify

format so that receiver can do conversion

- Alternatively Each data type may have a standard format.

Sender will convert data to standard format and receiver

will convert from standard format to its local

representation

Passing Parameters by Reference

Error handling, Semantics and

Correctness

RPC may fail either due to computer or communication failure

If the remote server is slow the program invoking remote procedure may call it twice.

If client crashes after sending RPC message

If client recovers quickly after crash and reissues RPC

Orphan Execution of Remote procedures

RPC Semantics

- At least once semantics

- Exactly Once

- At most once

Correctness Condition

Given by Panzieri & Srivastava

Let Ci denote call made by machine & Wi represents corresponding computation

If C2 happened after C1 ( C1 C2) & Computations W2 & W1 share the same

data, then to be correct in the presence of

failures RPC should satisfy

C1 C2 implies W1 W2

Theoretical Foundations

16

17

Theoretical Aspects Limitations of Distributed Systems

Logical Clocks

Vector Clocks

Limitations of Distributed Systems

Absence of Global clock

Absence of Shared Memory

Example

$500

Communication Channel S1:A S2:A

$200 (a)

$450


$200 (b)

$500


$250 (c)

20

Happened before relation:

a -> b : Event a occurred before event b. Events in the same process.

a -> b : If a is the event of sending a message m in a process and b is the event of receipt of the

same message m by another process.

a -> b, b -> c, then a -> c. -> is transitive.

Causally Ordered Events

a -> b : Event a causally affects event b

Concurrent Events

a || b: if a !-> b and b !-> a

Lamports Logical Clock

21

Space-time Diagram

Time

Space P1

P2

e11 e12 e13 e14

e21 e22 e23 e24

Internal

Events Messages

22

Logical Clocks Conditions satisfied:

Ci is clock in Process Pi.

If a -> b in process Pi, Ci(a) < Ci(b)

Let a: sending message m in Pi; b : receiving message m in Pj; then, Ci(a) < Cj(b).

Implementation Rules: R1: Ci = Ci + d (d > 0); clock is updated between two

successive events.

R2: Cj = max(Cj, tm+ d); (d > 0); When Pj receives a message m with a time stamp tm (tm assigned by Pi, the sender; tm = Ci(a), a being the event of sending message m).

A reasonable value for d is 1

23

Space-time Diagram

/

Time

Space

P1(0)

P2 (0)

e11 e12 e13 e14

e21 e22 e23 e24

e15 e16 e17

e25

(1) (2) (3) (4) (5) (6) (7)

(1) (2) (3) (4) (7)

24

Space-time Diagram

/

Time

Space

P1(0)

P2 (0)

a b c d e

f g h i

25

Limitation of Lamports Clock

Time

Space P1

P2

e11 e12 e13

e21 e22 e23

P3 e31 e32 e33

(1) (2) (3)

(1)

(1) (2)

(3) (4)

C(e11) < C(e32) but not causally related.

This inter-dependency not reflected in Lamports Clock.

26

Vector Clocks Keep track of transitive dependencies among

processes for recovery purposes.

Ci[1..n]: is a vector clock at process Pi whose entries are the assumed/best guess clock values of different processes.

Ci[j] (j != i) is the best guess of Pi for Pjs clock.

Vector clock rules: Ci[i] = Ci[i] + d, (d > 0); for successive events in Pi

For all k, Cj[k] = max (Cj[k],tm[k]), when a message m with time stamp tm is received by Pj from Pi.

: For all

27

Vector Clocks Comparison

1. ta = tb iff for all i, ta[i] = tb[i];

2. ta != tb iff there exists an i, ta[i] != tb[i];

3. ta

28

Vector Clock

Time

Space P1

P2

e11 e12 e13

e21 e22 e23 e24

P3 e31 e32

(1,0,0) (2,0,0) (3,4,1)

(0,1,0) (2,2,0) (2,4,1) (2,3,1)

(0,0,1) (0,0,2)

29

Next Lecture

Causal Order of Messages

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 3

Saturday, Aug 16 -

2014

Causal Order of Messages

- BSS Algorithm

- SES Algorithm

Global State

Cut

Todays Lecture

Causal ordering of Messages

8/16/2014 CS ZG623, Adv OS 3

Time

Space P1

P2

P3

Send(M1)

Send(M3)

Send(M2)

Birman- Schiper- Stephenson (BSS) protocol

8/16/2014 CS ZG623, Adv OS 4

Before broadcasting m, process Pi increments vector time

VT Pi [i] and timestamps m.

Process P j P i receives m with timestamp VT m from Pi ,

delays delivery until both:

VT Pj [i] = VT m [i] - 1 // received all previous ms

VT Pj [k] >= VT m [k] for every k {1,2,, n} - {i}

// received all messages also received by Pi before sending

message m When Pj delivers m, VT Pj updated by IR2 of vector clock

Example of BSS

8/16/2014 CS ZG623, Adv OS 5

P1

P2

P3

(buffer)

(0,0,1) (0,1,1)

(0,0,1) (0,1,1)

(0,0,1) (0,1,1)

deliver

from buffer

Schiper-Eggli-Sandoz (SES) protocol

8/16/2014 CS ZG623, Adv OS 6

No need for broadcast messages.

Each process maintains a vector V_P of size N - 1, N being the number of processes in the system.

V_P is a vector of tuple (P,t): P the destination process id and t, a vector timestamp.

Tm: logical time of sending message m

Tpi: present logical time at pi

Initially, V_P is empty.

SES Continued

8/16/2014 CS ZG623, Adv OS 7

Sending a Message:

Send message M, time stamped tm, along with V_P1 to P2.

Insert (P2, tm) into V_P1. Overwrite the previous value of (P2,t), if any.

(P2,tm) is not sent. Any future message carrying (P2,tm) in V_P1 cannot be delivered to P2 until tm < Tp2.

Delivering a message

If V_M (in the message) does not contain any pair (P2, t), it can be delivered.

/* (P2, t) exists in V_M*/ If t !< Tp2, buffer the message. (Dont deliver).

else (t < Tp2) deliver it

P1

P2

P3

x y z (0,0,0)

(0,0,0)

(0,0,0)

M1 M2

M3

VP1( __ , __ )

VP2( __ , __ )

VP3( __ , __ )

Initially Vector Time is (0,0,0) at all sites

Vector at each site is empty. Note here we have taken size of vector at each site

as 2. If You want you can extend it easily upto 3.

SES Algorithm Example

Figure 1

a b

Vectors Maintained at each site

VP1 ( __ , __ )

This entry gets updated when P1 sends any Message to P2


VP2 ( __ , __ )



VP3 ( __ , __ )



Consider Sending of Message M1 from P2 to P1 of Figure 1

Send Step at P2

1. P2 Sends Message M1 to P1 [ Vector accompanying Message

will be represented as VM ( __, __) i.e. empty ]. So P2 sends the

Message {( __, __), (0,1,0)} to P1.

2. After Sending Message it updates its own vector VP2. Now P1s corresponding entry in VP2 (i.e. Vector of Process P2) is empty so

the pair {P1,(0,1,0)} will be inserted in VP2. So before sending

message M1, VP2 is ( __,__) but after sending message M1 ,

P2s vector VP2 will be updated as {(P1,(0,1,0)), __)}.

Important Point :

Message M1 is sent with Earlier VP2 (as VM )i.e. VP2

existing at P2 before sending message M1. VP2 gets

updated with new entry after sending Message M1.

Receiving of M1 by P1 at Point y of Figure

Receiving of M1 by P1[Note here what P1 has received from P2. Message of form {( __, __), (0,1,0)} i.e. (VM,t).

Check

1. If the Accompanying VM contains any entry corresponding to P1,

If not receive the message . [This condition is satisfied here].

else (If the condition of step 1 is not satisfied then only go to else part )

2. Compare time t (Message time stamp) with TP1 (Current Time

running at P1). Time running at P1 before receiving the message

is (0,0,0). If t < TP1 then receive the message

else (if this step is false)

3. Buffer the Message [t >= TP1]

For Figure 1 only Step1 is fulfilled that is why M1 is received

at point y at P1. [Assume M3 has not been received at x]

What will happen at P1 when Message

is delivered at point y

STEP A Merge VP1 & VM. Update VP1 by values of VM. 1. If for any process say P (not equal to P1) there exits an entry in VM but does

not have any entry in VP1 then that entry has to be inserted in VP1.

2. If for any process say P (not equal to P1) there exists an entry in both VM

and VP1 then you have to update the value for that process in VP1 by setting

the time =max(a,b) where a is time in VM and b is time in VP1

In our case after receiving Message M1, both VP1 & VM are ( __, __) So

there Merge operation will be same i.e. ( __, __)

SO VP1 after Merge will be ( __, __ ) i.e. empty

STEP B. Update P1s logical clock so time at P1 will be (1,1,0) STEP C. Check for Buffered Messages at P1 . There is no buffered Messages at P1.

Now Complete State of System after Receiving M1 by P1

is shown as follows

Process Vector Time

P1 VP1 ( __, __) [Empty] (1,1,0) at y

P2 VP2 : {P1,(0,1,0), __ }. Note that VP2 updated after

sending Message M1

(0,1,0)

P3 VP3 ( __, __) [Empty] (0,0,0)

Now Consider the case when M2 is sent from P2 to P3

Send Step of M2 from P2 to P3

1. P2 Sends Message M2 to P3. So P2 sends the Message M2 with VM

as { P1,(0,1,0), __ } and time stamped as (0,2,0). So the format of

sent message looks like { (P1,(0,1,0), __ ), (0,2,0) }

2. After Sending Message it updates its own vector VP1. Here P2s vector entry corresponding to P3 is empty. So now VP2 becomes {

(P1,(0,1,0)), (P3,(0,2,0) } . Note that this will be VP2 after sending

message M2. So the state of the system looks like as follows

Process Vector Time

P1 VP1 ( __, __) [Empty] (1,1,0) at y

P2 VP2 : {P1,(0,1,0), P3,(0,2,0)} After sending M2 (0,2,0)

P3 VP3 ( __, __) [Empty] P3 has not received M2 yet (0,0,0)

Receiving of M2 by P3 at Point a of Figure

Receiving of M2 by P3[Note here what P3 has received from P2. Message of form [{ P1,(0,1,0), __ }, (0,2,0)] i.e. [VM,t]

Check

1. If the Accompanying VM contains any entry corresponding to P3, If not receive the message . [This condition is satisfied here. VMs entry for P3 is empty].


2. Compare time t which is in the message with time at P1 . Time running at P1 before receiving the message is (0,0,0). If t


M2 is delivered at point a

STEP A Merge VP3 & VM. Update VP3 by values of VM. 1. If for any process say P (not equal to P3) there exists an entry in VM but

does not have any entry in VP3 then that entry has to be inserted in VP1.




In This case

VM is [ {P1,(0,1,0)} , __ ] andVP3 is ( __ , __)

[Note VM contains entry for P1. Step1 condition ]

SO VP3 after Merge will be [{P1,(0,1,0)} , __ ]



at point a is shown as follows

Process Vector Time

P1 VP1 ( __, __) [Empty] (1,1,0) at y

P2 VP2 : [ {P1,(0,1,0)},{P2,(0,2,0)}] Note that VP2 updated after

sending Message M2

(0,2,0)

P3 VP3 ({P1,(0,1,0)}, __) (0,2,1) at a

Now Consider the case when M3 is sent from P3 to P1 at point

b of Figure 1

Send Step of M3 from P3 to P1

1. P3 Sends Message M3 to P1. So P3 sends the Message M3 with VM as {

P1,(0,1,0), __ } and time stamped as (0,2,2). So the format of sent message

looks like { (P1,(0,1,0), __ ), (0,2,2) }

2. After Sending Message M3 P3 updates its own vector VP3. VP3 already

contains an entry for P1 so that entry gets simply updated with latest time

value. Complete state of system after sending M3 by P3 is shown below.

Process Vector Time

P1 VP1 ( __, __) P1 has not received M3 so far (1,1,0) at y

P2 VP2 : {P1,(0,1,0), P3,(0,2,0)} After sending M2 (0,2,0)

P3 VP3 ( P1,(0,2,2), __) P3 has sent M3 (0,2,2) at b

Receiving of M3 by P1 at Point x of Figure1

Receiving of M3 by P1 at point x. [Note here what P1 has received from P3. Message of form [{ P1,(0,1,0), __ }, (0,2,2)] i.e. [VM,t]

Check

1. If the Accompanying VM contains any entry corresponding to P1, If not

receive the message . [This condition is not true for this case. Because

accompanying VM contains an entry corresponding to P1].


2. If P1 has received Message M3 at point x then current time at that point will

be (0,0,0) which is TP1. The message has come with time stamped t as (0,2,2). Clearly t is > TP1 so the message will be buffered.


M2 is delivered at point a

STEP A Merge VP3 & VM. Update VP3 by values of VM. 1. If for any process say P (not equal to P3) there exists an entry in VM but

does not have any entry in VP3 then that entry has to be inserted in VP1.




In This case

VM is [ {P1,(0,1,0)} , __ ] andVP3 is ( __ , __)

[Note VM contains entry for P1. Step1 condition ]

SO VP3 after Merge will be [{P1,(0,1,0)} , __ ]



at point a is shown as follows

Process Vector Time

P1 VP1 ( __, __) [Empty] (1,1,0) at y

P2 VP2 : [ {P1,(0,1,0)},{P2,(0,2,0)}] Note that VP2 updated after

sending Message M2

(0,2,0)

P3 VP3 ({P1,(0,1,0)}, __) (0,2,1) at a

Global State: The Model

8/16/2014 CS ZG623, Adv OS 22

Node properties:

No shared memory

No global clock

Channel properties:

FIFO

loss free

non-duplicating

The Need for Global State

Many problems in Distributed Computing can be

cast as executing some action on reaching a

particular state

e.g. -Distributed deadlock detection is finding a cycle in the

Wait For Graph.

-Termination detection

-Checkpointing

And many more..

Difficulties due to Non Determinism

Deterministic Computation

- At any point in computation there is at

most one event that can happen next.

Non-Deterministic Computation

- At any point in computation there can be

more than one event that can happen next.

Global state Example

8/16/2014 CS ZG623, Adv OS 25

Global State 3

Global State 2

Global State 1

Recording Global state

Let global state of A is recorded in (1) and not in (2).

State of B, C1, and C2 are recorded in (2)

Extra amount of $50 will appear in global state

Reason: As state recorded before sending message and C1s state after sending message.

Inconsistent global state if n < n, where

n is number of messages sent by A along channel before As state was recorded

n is number of messages sent by A along the channel before channels state was recorded.

Consistent global state: n = n

Continued

Similarly, for consistency m = m

m: no. of messages received along channel before Bs state recording

m: no. of messages received along channel by B before channels state was recorded.

Also, n >= m, as no. of messages sent along the channel

be less than that of received

Hence, n >= m

Consistent global state should satisfy the above equation.

Consistent global state:

Channel state: sequence of messages sent before recording senders state, excluding the messages received before receivers state was recorded.

Only transit messages are recorded in the channel state.

Notion of Consistency: Example

p

q

p q

Sp0 Sp

1 Sp2 Sp

3

Sq0 Sq

1 Sq2 Sq

3

m1

m2

m3

A Consistent State?

p

q

p q

Sp0 Sp

1 Sp2 Sp

3

Sq0 Sq

1 Sq2 Sq

3

m1

m2

m3

Sp1 Sq

1

Yes

p

q

p q

Sp0 Sp

1 Sp2 Sp

3

Sq0 Sq

1 Sq2 Sq

3

m1

m2

m3

Sp1 Sq

1

What about this?

p

q

p q

Sp0 Sp

1 Sp2 Sp

3

Sq0 Sq

1 Sq2 Sq

3

m1

m2

m3

Sp2 Sq

3

Yes

What about this?

p

q

p q

Sp0 Sp

1 Sp2 Sp

3

Sq0 Sq

1 Sq2 Sq

3

m1

m2

m3

Sp1 Sq

3

No

Chandy-Lamport GSR algorithm

8/16/2014 CS ZG623, Adv OS 33

Sender (Process p)

Record the state of p

For each outgoing channel (c) incident on p, send a marker before sending any other messages

Receiver (q receives marker on c1)

If q has not yet recorded its state

Record the state of q

Record the state of c1 as null

For each outgoing channel (c) incident on q , send a marker before sending any other message.

If q has already recorded its state

Record the state of (c1) as all the messages received since the last time of the state of q was recorded.

Uses of GSR

8/16/2014 CS ZG623, Adv OS 34

recording a consistent state of the global computation checkpointing for fault tolerance (rollback, recovery) testing and debugging monitoring and auditing

detecting stable properties in a distributed system via snapshots. A property is stable if, once it holds in a state, it holds in all subsequent states.

termination deadlock garbage collection

State Recording Example

8/16/2014 CS ZG623, Adv OS 35


8/16/2014 CS ZG623, Adv OS 36

Let p transfer 100 to q, q transfer 50 to p and 30 to r, and p wants to record the global state.

400

50

30


8/16/2014 CS ZG623, Adv OS 37


400

50

30

520


8/16/2014 CS ZG623, Adv OS 38


400

50

30

520

530

50

Cut

8/16/2014 CS ZG623, Adv OS 39

A cut is a set of cut events, one per node, each of which

captures the state of the node on which it occurs. It is

also a graphical representation of a global state.

Consistent Cut

8/16/2014 CS ZG623, Adv OS 40

An inconsistent cut :

ee , e c2 , c2 -/->c3

A cut C = {c1, c2, c3, } is consistent if for all sites there are no events ei and ej such that:

(ei --> ej) and (ej --> cj) and (ei -/-> ci), ci , cj C

e

e

Ordering of Cut events

8/16/2014 CS ZG623, Adv OS 41

The cut events in a consistent cut are not causally related. Thus, the cut is a set of concurrent events and a set of concurrent events is a cut.

Note, in this inconsistent cut, c3 --> c2.

Termination detection

8/16/2014 CS ZG623, Adv OS 42

Question:

In a distributed computation,

when are all of the processes

become idle (i.e., when has the

computation terminated)?

Huangs algorithm

8/16/2014 CS ZG623, Adv OS 43

The computation starts when the controlling agent sends the first message and terminates when all processes are idle.

The role of weights: the controlling agent initially has a weight of 1 and all others

have a weight of zero,

when a process sends a message, a portion of the senders weight is put in the message, reducing the senders weight,

a receiver adds to its weight the weight of a received message, on becoming idle, a process sends it weight to the controlling

agent,

the sum of all weights is always 1.

The computation terminates when the weight of the controlling agent reaches 1 after the first message.

Continued

8/16/2014 CS ZG623, Adv OS 44

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 4

Saturday, Aug 23-2014


-Non-Token Based Algorithms (Will be Covered Today)

1. Lamports Algorithm

2. Ricart-Agarwala Algorithm

3. Maekawas Algorithm

-Token Based Algorithms (Next Class)

1. Suzuki-Kasami Algorithm

2. Singhals Hieuristic Search

3. Raymonds Tree Based

Todays Lecture

3

Mutual Exclusion

Access to shared resource by several un-coordinated user requests must be

serialized

Actions performed on shared resource on user behalf must be atomic

Examples : Directory Managment

4

Classification of Mutual Exclusion

Algorithms

Non-token based:

A site/process can enter a critical section when an

assertion (condition) becomes true.

Algorithm should ensure that the assertion will be true

in only one site/process.

Token based:

A unique token (a known, PRIVILEGE message) is shared

among cooperating sites/processes.

Possessor of the token has access to critical section.

Need to take care of conditions such as loss of token,

crash of token holder, possibility of multiple tokens, etc.

5

General System Model

At any instant, a site may have several requests for critical section (CS), queued up, and serviced one at

a time.

Site States: Requesting CS, executing CS, idle (neither requesting nor executing CS).

Requesting CS: blocked until granted access, cannot make additional requests for CS.

Executing CS: using the CS.

Idle: action is outside the site. In token-based approaches, idle site can have the token.

6

Mutual Exclusion: Requirements

Freedom from deadlocks: two or more sites should not endlessly wait on conditions/messages that never

become true/arrive.

Freedom from starvation: No indefinite waiting.

Fairness: Order of execution of CS follows the order of the requests for CS. (equal priority).

Fault tolerance: recognize faults, reorganize, continue. (e.g., loss of token).

7

Performance

Number of messages per CS invocation: should be minimized.

Synchronization delay, i.e., time between the leaving of CS by a site and the entry of CS by the next one: should be minimized.

Response time: time interval between request messages transmissions and exit of CS.

System throughput, i.e., rate at which system executes requests for CS: should be maximized.

If sd is synchronization delay, E the average CS execution time: system throughput = 1 / (sd + E).

8

Performance metrics

Last site

exits CS

Next site

enters CS

Synchronization

delay

Time

Time

CS Request

arrives

Messages

sent Enter CS Exit CS

E

Response Time

9

Performance ...

Low and High Load:

Low load: No more than one request at a given point in time.

High load: Always a pending mutual exclusion request at a site.

Best and Worst Case:

Best Case (low loads): Round-trip message delay + Execution time. 2T + E.

Worst case (high loads).

Message traffic: low at low loads, high at high loads.

Average performance: when load conditions fluctuate widely.

10

Simple Solution

Control site: grants permission for CS execution.

A site sends REQUEST message to control site.

Controller grants access one by one.

Synchronization delay: 2T -> A site release CS by sending message to controller and controller sends

permission to another site.

System throughput: 1/(2T + E). If synchronization delay is reduced to T, throughput doubles.

Controller becomes a bottleneck, congestion can occur.

11

Non-token Based Algorithms

Notations:

Si: Site I

Ri: Request set, containing the ids of all Sis from which permission must be received before accessing CS.

Non-token based approaches use time stamps to order requests for CS.

Smaller time stamps get priority over larger ones.

Lamports Algorithm

Ri = {S1, S2, , Sn}, i.e., all sites.

Request queue: maintained at each Si. Ordered by time stamps.

Assumption: message delivered in FIFO.

12

Lamports Algorithm

Requesting CS: Send REQUEST(tsi, i). (tsi,i): Request time stamp. Place

REQUEST in request_queuei.

On receiving the message; sj sends time-stamped REPLY message to si. Sis request placed in request_queuej.

Executing CS: Si has received a message with time stamp larger than (tsi,i)

from all other sites.

Sis request is the top most one in request_queuei.

Releasing CS: Exiting CS: send a time stamped RELEASE message to all

sites in its request set.

Receiving RELEASE message: Sj removes Sis request from its queue.

13

Lamports Algorithm

Performance.

3(N-1) messages per CS invocation. (N - 1) REQUEST, (N - 1) REPLY, (N - 1) RELEASE messages.

Synchronization delay: T

Optimization

Suppress reply messages. (e.g.,) Sj receives a REQUEST message from Si after sending its own REQUEST message

with time stamp higher than that of Sis. Do NOT send REPLY message.

Messages reduced to between 2(N-1) and 3(N-1).

14

Lamports Algorithm: Example

(2,1)

(1,2)

S1

S2

S3

(1,2) (2,1)

(1,2) (2,1)

S1

S2

S3

Step 1:

Step 2:

S2 enters CS

(1,2) (2,1)

15

Lamports: Example

(1,2) (2,1)

(1,2) (2,1)

S1

S2

S3

Step 3:

(1,2) (2,1)

S2 leaves CS

(1,2) (2,1)

(1,2) (2,1)

S1

S2

S3

Step 4:

(1,2) (2,1)

S1 enters CS

(2,1)

(2,1)

(2,1)

16

Ricart-Agrawala Algorithm

Requesting critical section Si sends time stamped REQUEST message

Sj sends REPLY to Si, if

Sj is not requesting nor executing CS

If Sj is requesting CS and Sis time stamp is smaller than its own request.

Request is deferred otherwise.

Executing CS: after it has received REPLY from all sites in its request set.

Releasing CS: Send REPLY to all deferred requests. i.e., a sites REPLY messages are blocked only by sites with smaller time stamps

17

Ricart-Agrawala: Performance

Performance: 2(N-1) messages per CS execution. (N-1) REQUEST + (N-1)

REPLY.

Synchronization delay: T.

Optimization [Proposed by Roucairol and Carvalho]: When Si receives REPLY message from Sj -> authorization

to access CS till

Sj sends a REQUEST message and Si sends a REPLY message.

Access CS repeatedly till then.

A site requests permission from dynamically varying set of sites: 0 to 2(N-1) messages.

18

Ricart-Agrawala: Example

(2,1)

(1,2)

S1

S2

S3

(2,1)

S1

S2

S3

Step 1:

Step 2:

S2 enters CS

19

Ricart-Agrawala: Example

(2,1)

S1

S2

S3

Step 3:

S1 enters CS

S2 leaves CS

20

Maekawas Algorithm

A site requests permission only from a subset of sites.

Request set of sites si & sj: Ri, Rj such that Ri and Rj will have atleast one common site (Sk). Sk mediates conflicts between Ri and Rj.

A site can send only one REPLY message at a time, i.e., a site can send a REPLY message only after receiving a RELEASE message for the previous REPLY message.

Request Sets Rules: Sets Ri and Rj have atleast one common site.

Si is always in Ri.

Cardinality of Ri, i.e., the number of sites in Ri is K.

Any site Si is in K number of Ris. N = K(K - 1) + 1 -> K = square root of N.

21

Maekawas Algorithm ...

Requesting CS Si sends REQUEST(i) to sites in Ri.

Sj sends REPLY to Si if

Sj has NOT sent a REPLY message to any site after it received the last RELEASE message.

Otherwise, queue up Sis request.

Executing CS: after getting REPLY from all sites in Ri.

Releasing CS send RELEASE(i) to all sites in Ri

Any Sj after receiving RELEASE message, send REPLY message to the next request in queue.

If queue empty, update status indicating receipt of RELEASE.

22

Request Subsets

Example k = 2; (N = 3).

R1 = {1, 2}; R3 = {1, 3}; R2 = {2, 3}

Example k = 3; N = 7.

R1 = {1, 2, 3}; R4 = {1, 4, 5}; R6 = {1, 6, 7};

R2 = {2, 4, 6}; R5 = {2, 5, 7}; R7 = {3, 4, 7};

R3 = {3, 5, 6}

23


Performance Synchronization delay: 2T

Messages: 3 times square root of N (one each for REQUEST, REPLY, RELEASE messages)

Deadlocks Message deliveries are not ordered.

Assume Si, Sj, Sk concurrently request CS

Ri intersection Rj = {Sij}, Rj Rk = {Sjk}, Rk Ri = {Ski}

Possible that:

Sij is locked by Si (forcing Sj to wait at Sij)

Sjk by Sj (forcing Sk to wait at Sjk)

Ski by Sk (forcing Si to wait at Ski)

-> deadlocks among Si, Sj, and Sk.

24

Handling Deadlocks in Maekawas Algorithm

Si yields to a request if that has a smaller time stamp.

A site suspects a deadlock when it is locked by a request with a higher time stamp (lower priority).

Deadlock handling messages:

FAILED: from Si to Sj -> Si has granted permission to higher priority request.

INQUIRE: from Si to Sj -> Si would like to know Sj has succeeded in locking all sites in Sjs request set.

YIELD: from Si to Sj -> Si is returning permission to Sj so that Sj can yield to a higher priority request.

25

Handling Deadlocks

REQUEST(tsi,i) to Sj: Sj is locked by Sk -> Sj sends FAILED to Si, if Sis request has

higher time stamp.

Otherwise, Sj sends INQUIRE(j) to Sk.

INQUIRE(j) to Sk: Sk sends a YIELD (k) to Sj, if Sk has received a FAILED

message from a site in Sks set. (or) if Sk sent a YIELD and has not received a new REPLY.

YIELD(k) to Sj: Sj assumes it has been released by Sk, places Sks request in

its queue appropriately, sends a REPLY(j) to the top request in its queue.

Sites may exchange these messages even if there is no real deadlock. Maximum number of messages per CS request: 5 times square root of N.

Next Class

Token-Based Algorithms For Distributed

Mutual Exclusion

26

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 5

Sunday, Aug 24-2014


-Non-Token Based Algorithms

1. Lamports Algorithm (Covered in Previous Class)

2. Ricart-Agarwala Algorithm (Covered in Previous Class)

3. Maekawas Algorithm (Will be Covered Today)

-Token Based Algorithms (Will be Covered Today)

1. Suzuki-Kasami Algorithm

2. Singhals Hieuristic Search

3. Raymonds Tree Based

Todays Lecture

3

Maekawas Algorithm

A site requests permission only from a subset of sites.

Request set of sites si & sj: Ri, Rj such that Ri and Rj will have atleast one common site (Sk). Sk mediates conflicts between Ri and Rj.

A site can send only one REPLY message at a time, i.e., a site can send a REPLY message only after receiving a RELEASE message for the previous REPLY message.

Request Sets Rules: Sets Ri and Rj have atleast one common site.

Si is always in Ri.

Cardinality of Ri, i.e., the number of sites in Ri is K.

Any site Si is in K number of Ris. N = K(K - 1) + 1 -> K = square root of N.

4


Requesting CS Si sends REQUEST(i) to sites in Ri.

Sj sends REPLY to Si if

Sj has NOT sent a REPLY message to any site after it received the last RELEASE message.

Otherwise, queue up Sis request.

Executing CS: after getting REPLY from all sites in Ri.

Releasing CS send RELEASE(i) to all sites in Ri

Any Sj after receiving RELEASE message, send REPLY message to the next request in queue.

If queue empty, update status indicating receipt of RELEASE.

5

Request Subsets

Example k = 2; (N = 3).

R1 = {1, 2}; R3 = {1, 3}; R2 = {2, 3}

Example k = 3; N = 7.

R1 = {1, 2, 3}; R4 = {1, 4, 5}; R6 = {1, 6, 7};

R2 = {2, 4, 6}; R5 = {2, 5, 7}; R7 = {3, 4, 7};

R3 = {3, 5, 6}

6


Performance Synchronization delay: 2T

Messages: 3 times square root of N (one each for REQUEST, REPLY, RELEASE messages)

Deadlocks Message deliveries are not ordered.

Assume Si, Sj, Sk concurrently request CS

Ri intersection Rj = {Sij}, Rj Rk = {Sjk}, Rk Ri = {Ski}

Possible that:

Sij is locked by Si (forcing Sj to wait at Sij)

Sjk by Sj (forcing Sk to wait at Sjk)

Ski by Sk (forcing Si to wait at Ski)

-> deadlocks among Si, Sj, and Sk.

7

Handling Deadlocks in Maekawas Algorithm

Si yields to a request if that has a smaller time stamp.

A site suspects a deadlock when it is locked by a request with a higher time stamp (lower priority).

Deadlock handling messages:

FAILED: from Si to Sj -> Si has granted permission to higher priority request.

INQUIRE: from Si to Sj -> Si would like to know Sj has succeeded in locking all sites in Sjs request set.

YIELD: from Si to Sj -> Si is returning permission to Sj so that Sj can yield to a higher priority request.

8

Handling Deadlocks

REQUEST(tsi,i) to Sj: Sj is locked by Sk -> Sj sends FAILED to Si, if Sis request has

higher time stamp.

Otherwise, Sj sends INQUIRE(j) to Sk.

INQUIRE(j) to Sk: Sk sends a YIELD (k) to Sj, if Sk has received a FAILED

message from a site in Sks set. (or) if Sk sent a YIELD and has not received a new REPLY.

YIELD(k) to Sj: Sj assumes it has been released by Sk, places Sks request in

its queue appropriately, sends a REPLY(j) to the top request in its queue.

Sites may exchange these messages even if there is no real deadlock. Maximum number of messages per CS request: 5 times square root of N.

9

Token-based Algorithms

Unique token circulates among the participating sites.

A site can enter CS if it has the token.

Token-based approaches use sequence numbers instead of time stamps.

Request for a token contains a sequence number.

Sequence number of sites advance independently.

Correctness issue is trivial since only one token is present -> only one site can enter CS.

Deadlock and starvation issues to be addressed.

10

Suzuki-Kasami Algorithm

If a site without a token needs to enter a CS, broadcast a REQUEST for token message to all other sites.

Token: (a) Queue of request sites (b) Array LN[1..N], the sequence number of the most recent execution by a site j.

Token holder sends token to requestor, if it is not inside CS. Otherwise, sends after exiting CS.

Token holder can make multiple CS accesses.

Design issues:

Distinguishing outdated REQUEST messages.

Format: REQUEST(j,n) -> jth site making nth request.

Each site has RNi[1..N] -> RNi[j] is the largest sequence number of request from j.

Determining which site has an outstanding token request.

If LN[j] = RNi[j] - 1, then Sj has an outstanding request.

11

Suzuki-Kasami Algorithm ...

Passing the token

After finishing CS

(assuming Si has token), LN[i] := RNi[i]

Token consists of Q and LN. Q is a queue of requesting sites.

Token holder checks if RNi[j] = LN[j] + 1. If so, place j in Q.

Send token to the site at head of Q.

Performance

0 to N messages per CS invocation.

Synchronization delay is 0 (if the token holder repeats CS) or T.

Suzuki Kasami Example

Consider 5 Sites

S1 S2

S4

S3

S5

n=1 n=2

n=5

n=4

n=3

0

1

0

2

0

3

0

4

0

5 0

1

0

2

0

3

0

4

0

5

0

1

0

2

0

3

0

4

0

5

0

1

0

2

0

3

0

4

0

5

0

1

0

2

0

3

0

4

0

5

RN1 RN2

RN5

RN3

RN4

0

1

0

2

0

3

0

4

0

5

LN

Q { } (empty)

Token at Site

S1

Site S1 wants to enter CS

S1 wont execute Request Part of the algorithm

S1 enters CS as it has token

When S1 release CS,it performs following tasks

(i) LN[1] = RN1[1] = 0 i.e. same as before

(ii) If during S1s CS execution, any request from other site (say S2) have come then that Sites id will now be entered into the queue if RN1[2] = LN[2]+1

So if site holding (not received from other) the token enters CS then overall state of the system wont change it remains same.

Site S2 wants to enter CS

S2 does not have token so it sends REQUEST(2,1) message to every other site

Sites S1,S3,S4 and S5 updates their RN as follows

0

1

1

2

0

3

0

4

0

5

It may be possible that S2s request came at S1 when S1 is under CS execution So at this point S1 will not pass the token to S2. It will finish its CS and then adds S2

Into Queue because at that point

RN1[2] = 1 which is LN[2]+1

Now as the queue is not empty, so S1 pass token to S2 by removing S2 entry from

Token queue.

Site S2 Sends REQUEST(2,1)

S1 S2

S4

S3

S5

n=1 n=2

n=5

n=4

0

1

1

2

0

3

0

4

0

5 0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

RN1 RN2

RN3

RN4

0

1

0

2

0

3

0

4

0

5

LN

Q {S2 } [Inserted only when S1 leave CS]

Token at Site

S1

REQUEST(2,1)

REQUEST(2,1)

REQUEST(2,1)

REQUEST(2,1)

Site S1 Sends Token to S2

S1 S2

S4

S3

S5

n=1 n=2

n=5

n=4

0

1

1

2

0

3

0

4

0

5 0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

RN1 RN2

RN3

RN4

LN Token at Site S2

0

1

0

2

0

3

0

4

0

5

Q{}

LN

Site S2 Leaves CS

S1 S2

S4

S3

S5

n=1 n=2

n=5

n=4

0

1

1

2

0

3

0

4

0

5 0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

0

1

1

2

0

3

0

4

0

5

RN1 RN2

RN3

RN4

LN Token at Site S2

0

1

1

2

0

3

0

4

0

5

Q{}

LN

Now suppose S2 again wants to enter

CS

S2 can enter CS as token queue is empty i.e. no other site has requested for CS and

token is currently held by S2

When S2 release CS it updates LN[2] as RN2[2] which is same as 1.

Salient Points

A site updates its RN table as soon as it receives Request message from any other site [At this point a site may hold an ideal token or

may be executing CS]

When a site enters CS without sending request message (i.e. when site wants to again enter CS without making request) then its RN

entry and LN entry of token remains unchanged

A site can be added in to token queue only by the site who is releasing the CS

A site holding the token can repeatedly enter CS until it does not send token to some other site.

A site gives priority to other sites with outstanding requests for the CS over its pending requests for the CS

21

Singhals Heuristic Algorithm

Instead of broadcast: each site maintains information on other sites, guess the sites likely to have the token.

Data Structures: Si maintains SVi[1..M] and SNi[1..M] for storing information on

other sites: state and highest sequence number.

Token contains 2 arrays: TSV[1..M] and TSN[1..M].

States of a site

R : requesting CS

E : executing CS

H : Holding token, idle

N : None of the above

Initialization: SVi[j] := N, for j = M .. i; SVi[j] := R, for j = i-1 .. 1; SNi[j] := 0, j =

1..N. S1 (Site 1) is in state H.

Token: TSV[j] := N & TSN[j] := 0, j = 1 .. N.

22


Requesting CS If Si has no token and requests CS:

SVi[i] := R. SNi[i] := SNi[i] + 1.

Send REQUEST(i,sn) to sites Sj for which SVi[j] = R. (sn: sequence number, updated value of SNi[i]).

Receiving REQUEST(i,sn): if sn SVj[i] := R.

SVj[j] = R -> If SVj[i] != R, set it to R & send REQUEST(j,SNj[j]) to Si. Else do nothing.

SVj[j] = E -> SVj[i] := R.

SVj[j] = H -> SVj[i] := R, TSV[i] := R, TSN[i] := sn, SVj[j] = N. Send token to Si.

Executing CS: after getting token. Set SVi[i] := E.

23


Releasing CS SVi[i] := N, TSV[i] := N. Then, do:

For other Sj: if (SNi[j] > TSN[j]), then {TSV[j] := SVi[j]; TSN[j] := SNi[j]}

else {SVi[j] := TSV[j]; SNi[j] := TSN[j]}

If SVi[j] = N, for all j, then set SVi[i] := H. Else send token to a site Sj provided SVi[j] = R.

Fairness of algorithm will depend on choice of Si, since no queue is maintained in token.

Arbitration rules to ensure fairness used.

Performance Low to moderate loads: average of N/2 messages.

High loads: N messages (all sites request CS).

Synchronization delay: T.

24

Singhal: Example

S2

S3

S4

Sn

S1

1 2 3 4 n

H

R N

R

N N N

N N N

N N

N

N N N N

R

R R R

R R R R

S1

S2

S4

Sn

S3

1 2 3 4 n

N

N N

R

R N N

R N N

N N

N

N H N N

N

R R R

R R R R

(a) Initial Pattern

Each row in the matrix has

increasing number of Rs.

(b) Pattern after S3 gets the token from

S1.

Stair case is pattern can be identified

by noting that S1 has 1 R and S2 has 2

Rs and so on. Order of occurrence of R

in a row does not matter.

25

Singhal: Example

Assume there are 3 sites in the system. Initially:

Site 1: SV1[1] = H, SV1[2] = N, SV1[3] = N. SN1[1], SN1[2], SN1[3] are 0.

Site 2: SV2[1] = R, SV2[2] = N, SV2[3] = N. SNs are 0.

Site 3: SV3[1] = R, SV3[2] = R, SV3[3] = N. SNs are 0.

Token: TSVs are N. TSNs are 0.

Assume site 2 is requesting token.

S2 sets SV2[2] = R, SN2[2] = 1.

S2 sends REQUEST(2,1) to S1 (since only S1 is set to R in SV[2])

S1 receives the REQUEST. Accepts the REQUEST since SN1[2] is smaller than

the message sequence number.

Since SV1[1] is H: SV1[2] = R, TSV[2] = R, TSN[2] = 1, SV1[1] = N.

Send token to S2

S2 receives the token. SV2[2] = E. After exiting the CS, SV2[2] = TSV[2] = N.

Updates SN, SV, TSN, TSV. Since nobody is REQUESTing, SV2[2] = H.

Assume S3 makes a REQUEST now. It will be sent to both S1 and S2. Only S2

responds since only SV2[2] is H (SV1[1] is N now).

26

Raymonds Algorithm

Sites are arranged in a logical directed tree. Root: token holder. Edges: directed towards root.

Every site has a variable holder that points to an immediate neighbor node, on the directed path towards root. (Roots holder point to itself).

Requesting CS If Si does not hold token and request CS, sends REQUEST upwards

provided its request_q is empty. It then adds its request to request_q.

Non-empty request_q -> REQUEST message for top entry in q (if not done before).

Site on path to root receiving REQUEST -> propagate it up, if its request_q is empty. Add request to request_q.

Root on receiving REQUEST -> send token to the site that forwarded the message. Set holder to that forwarding site.

Any Si receiving token -> delete top entry from request_q, send token to that site, set holder to point to it. If request_q is non-empty now, send REQUEST message to the holder site.

27

Raymonds Algorithm

Executing CS: getting token with the site at the top of request_q. Delete top of request_q, enter CS.

Releasing CS If request_q is non-empty, delete top entry from q, send token

to that site, set holder to that site.

If request_q is non-empty now, send REQUEST message to the holder site.

Performance Average messages: O(log N) as average distance between 2

nodes in the tree is O(log N).

Synchronization delay: (T log N) / 2, as average distance between 2 sites to successively execute CS is (log N) / 2.

Greedy approach: Intermediate site getting the token may enter CS instead of forwarding it down. Affects fairness, may cause starvation.

28

Raymonds Algorithm: Example

S1

S4 S5

S2

S7

S3

S6

Token

holder

Token

request

S1

S4 S5

S2

S7

S3

S6

Step 1:

Step 2:

Token

29

Raymonds Algm.: Example

S1

S4 S5

S2

S7

S3

S6

Step 3:

Token

holder

30

Comparison

Non-Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Lamport 2T+E T 3(N-1) 3(N-1)

Ricart-Agrawala 2T+E T 2(N-1) 2(N-1)

Maekawa 2T+E 2T 3*sq.rt(N) 5*sq.rt(N)

Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Suzuki-Kasami 2T+E T N N

Singhal 2T+E T N/2 N

Raymond T(log N)+E Tlog(N)/2 log(N) 4

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 6

Friday, Aug 29-2014


-Non-Token Based Algorithms

1. Lamports Algorithm (Covered in Previous Class)

2. Ricart-Agarwala Algorithm (Covered in Previous Class)

3. Maekawas Algorithm (Covered in Previous Class)

-Token Based Algorithms

1. Suzuki-Kasami Algorithm (Covered in Previous Class)

2. Singhals Hieuristic Search (Covered in Previous Class)

* Singhals Algorithm Example (Will be Covered Today)

3. Raymonds Algorithm (Will be Covered Today)

-Comparisons of DMS Algorithms (Will be Covered Today)

- Introduction to Distributed Deadlock (Will be Covered Today)

Todays Lecture

Singhals DMS Algorithm Example

R R R R N

R R R N N

R R N N N

R N N N N

H N N N N

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

R R R R R

R R R R N

R R R N N

R R N N N

H N N N N

0 0 0 0 1

0 0 0 1 0

0 0 1 0 0

0 1 0 0 0

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

S5

S4

S3

S2

S1

S5, S4, S3,S2 Requests for CS

S5 -> S4, S3, S2,S1 , S4 -> S3, S2,S1 S3 -> S2,S1

S2 -> S1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

R R R R R

R R R R R

R R R N R

R R N N R

H N N N N

0 0 0 0 1

0 0 0 1 1

0 0 1 0 1

0 1 0 0 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

S5s CS Request Received by S4, S3 , S2

R R R R R

R R R R R

R R R N R

R R N N R

H N N N N

0 1 1 1 1

0 0 0 1 1

0 0 1 0 1

0 1 0 0 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

In Response to S5s CS Request Sites S4, S3 , S2 will also send request

R R R R R

R R R R R

R R R R R

R R N R R

H N N N N

0 1 1 1 1

0 0 0 1 1

0 0 1 1 1

0 1 0 1 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

S4s CS Request Received by S3 , S2

R R R R R

R R R R R

R R R R R

R R N R R

H N N N N

0 1 1 1 1

0 1 1 1 1

0 0 1 1 1

0 1 0 1 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

In Response to S4s CS Request Sites S3 , S2 will also send request back to S4

R R R R R

R R R R R

R R R R R

R R R R R

H N N N N

0 1 1 1 1

0 1 1 1 1

0 0 1 1 1

0 1 1 1 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

S3s CS Request Received by S2

R R R R R

R R R R R

R R R R R

R R R R R

H N N N N

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 0 0 0 0

SV SN

0 0 0 0 0 N N N N N

TSN TSV

S5

S4

S3

S2

S1

In Response to S3s CS Request Site S2 will also send request back to S3

S5s CS Request Received by S1

R R R R R

R R R R R

R R R R R

R R R R R

N N N N R

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 0 0 0 1

SV SN

0 0 0 0 1 N N N N R

TSN TSV

S5

S4

S3

S2

S1

S1 Will Pass Token to S5

S4,S3 CS Request Received by S1

R R R R R

R R R R R

R R R R R

R R R R R

N N R R R

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 0 1 1 1

SV SN

0 0 0 0 1 N N N N R

TSN TSV

S5

S4

S3

S2

S1

S5 Receives Token From S1 and enters CS

R R R R E

R R R R R

R R R R R

R R R R R

N N N N R

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 0 0 0 1

SV SN

0 0 0 0 1 N N N N R

TSN TSV

S5

S4

S3

S2

S1

S5 Releases CS

N R R R N

R R R R R

R R R R R

R R R R R

N N N N R

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 1 1 1 1

0 0 0 0 1

SV SN

0 1 1 1 1 N R R R N

TSN TSV

S5

S4

S3

S2

S1

10

Raymonds Algorithm

Sites are arranged in a logical directed tree. Root: token holder. Edges: directed towards root.

Every site has a variable holder that points to an immediate neighbor node, on the directed path towards root. (Roots holder point to itself).

Requesting CS If Si does not hold token and request CS, sends REQUEST upwards

provided its request_q is empty. It then adds its request to request_q.

Non-empty request_q -> REQUEST message for top entry in q (if not done before).

Site on path to root receiving REQUEST -> propagate it up, if its request_q is empty. Add request to request_q.

Root on receiving REQUEST -> send token to the site that forwarded the message. Set holder to that forwarding site.

Any Si receiving token -> delete top entry from request_q, send token to that site, set holder to point to it. If request_q is non-empty now, send REQUEST message to the holder site.

11

Raymonds Algorithm

Executing CS: getting token with the site at the top of request_q. Delete top of request_q, enter CS.

Releasing CS If request_q is non-empty, delete top entry from q, send token

to that site, set holder to that site.

If request_q is non-empty now, send REQUEST message to the holder site.

Performance Average messages: O(log N) as average distance between 2

nodes in the tree is O(log N).

Synchronization delay: (T log N) / 2, as average distance between 2 sites to successively execute CS is (log N) / 2.

Greedy approach: Intermediate site getting the token may enter CS instead of forwarding it down. Affects fairness, may cause starvation.

12

Raymonds Algorithm: Example

S1

S4 S5

S2

S7

S3

S6

Token

holder

Token

request

S1

S4 S5

S2

S7

S3

S6

Step 1:

Step 2:

Token

13

Raymonds Algm.: Example

S1

S4 S5

S2

S7

S3

S6

Step 3:

Token

holder

14

Comparison

Non-Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Lamport 2T+E T 3(N-1) 3(N-1)

Ricart-Agrawala 2T+E T 2(N-1) 2(N-1)

Maekawa 2T+E 2T 3*sq.rt(N) 5*sq.rt(N)

Token Resp. Time(ll) Sync. Delay Messages(ll) Messages(hl)

Suzuki-Kasami 2T+E T N N

Singhal 2T+E T N/2 N

Raymond T(log N)+E Tlog(N)/2 log(N) 4

15

Deadlock

Deadlock is a situation where a process or a set of processes is blocked on an event

that never occurs

Processes while holding some resources may request for additional allocation of

resources which are held by other

processes

Processes are in circular wait for the resources

16

Deadlock vs Starvation

Starvation occurs when a process waits for a resource that becomes available continuously but is not allocated

to a process

Two Main Differences

- In starvation it is not certain that a process will ever get

the requested resource where as a deadlocked process

is permanently blocked because required resource never

become available

- In starvation the resource under contention is in

continuation use where as this is not true in case of

deadlock

17

Necessary Conditions for Deadlock

Exclusive access

Wait while hold

No Preemption

Circular wait

18

Models of Deadlock

Single-Unit Request Model

- Process is restricted to request only one resource at a

time

- Outdegree in WFG is one

- Cycle in WFG means deadlock

P1 P2

P3

19

Models of Deadlock

AND Request Model

- Process can simultaneously request multiple resources

- Process Remain blocked until all the resources are granted

- Outdegree of WFG can be more than 1

- Cycle in WFG means system is deadlocked

- Process can be involved in more than one deadlock

P1 P2

P3 P4

20

Models of Deadlock

OR Request Model

- Process can simultaneously request multiple resources

- Process Remain blocked until it is granted any of the requested

resources

- Outdegree of WFG can be more than 1

- Cycle in WFG is not a sufficient condition for the deadlock

- Knot in the WFG is a sufficient condition for deadlock

- Knot is a subset of graph such that starting from any node in the

subset it is impossible to leave the knot by following the edges of the

graph

Cycle vs Knot

P1 P2

P3

P4

P5

Cycle but no

Knot

Deadlock in AND Model

But no Deadlock in OR Model

P1 P2

P3

P4

P5

Cycle & Knot

Deadlock in both AND & OR Model

Resources

Reusable (CPU, Main-memory, I/O Devices)

Consumable (Messages, Interrupt Signals

23

Distributed Deadlock Detection

Assumptions:

a. System has only reusable resources

b. Only exclusive access to resources

c. Only one copy of each resource

d. States of a process: running or blocked

e. Running state: process has all the resources

f. Blocked state: waiting on one or more resource

24

Resource vs Communication

Deadlocks

Resource Deadlocks

A process needs multiple resources for an activity.

Deadlock occurs if each process in a set request resources

held by another process in the same set, and it must receive

all the requested resources to move further.

Communication Deadlocks

Processes wait to communicate with other processes in a set.

Each process in the set is waiting on another processs

message, and no process in the set initiates a message

until it receives a message for which it is waiting.

25

Graph Models

Nodes of a graph are processes. Edges of a graph the pending requests or assignment of resources.

Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting for a resource from P2.

Transaction-wait-for Graphs (TWF): WFG in databases.

Deadlock: directed cycle in the graph.

Cycle example:

P1 P2

26

Graph Models

Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting for a resource from P2.

P1

P2

R1

R2

Request

Edge

Assignment

Edge

27

AND, OR Models

AND Model

A process/transaction can simultaneously request for multiple resources.

Remains blocked until it is granted all of the requested resources.

OR Model

A process/transaction can simultaneously request for multiple resources.

Remains blocked till any one of the requested resource is granted.

28

Sufficient Condition

P1 P2

P3 P4

P5

P6

Deadlock ??

29

AND, OR Models

AND Model

Presence of a cycle.

P1 P2

P3 P4

P5

30

AND, OR Models

OR Model

Presence of a knot.

Knot: Subset of a graph such that starting from any node in the subset, it is impossible to leave the knot by

following the edges of the graph.

P1 P2

P3 P4

P5

P6

31

Deadlock Handling Strategies

Deadlock Prevention: difficult

Deadlock Avoidance: before allocation, check for possible deadlocks.

Difficult as it needs global state info in each site (that handles resources).

Deadlock Detection: Find cycles. Focus of discussion.

Deadlock detection algorithms must satisfy 2 conditions:

No undetected deadlocks.

No false deadlocks.

32

Distributed Deadlocks

Centralized Control

A control site constructs wait-for graphs (WFGs) and checks for directed cycles.

WFG can be maintained continuously (or) built on-demand by requesting WFGs from individual sites.

Distributed Control

WFG is spread over different sites.Any site can initiate the deadlock detection process.

Hierarchical Control

Sites are arranged in a hierarchy.

A site checks for cycles only in descendents.

33

Centralized Algorithms

Ho-Ramamurthy 2-phase Algorithm

Each site maintains a status table of all processes initiated at that site: includes all resources locked & all resources being waited on.

Controller requests (periodically) the status table from each site.

Controller then constructs WFG from these tables, searches for cycle(s).

If no cycles, no deadlocks.

Otherwise, (cycle exists): Request for state tables again.

Construct WFG based only on common transactions in the 2 tables.

If the same cycle is detected again, system is in deadlock.

Later proved: cycles in 2 consecutive reports need not result in a deadlock. Hence, this algorithm detects false deadlocks.

34

Centralized Algorithms...

Ho-Ramamoorthy 1-phase Algorithm

Each site maintains 2 status tables: resource status table and process status table.

Resource table: transactions that have locked or are waiting for resources.

Process table: resources locked by or waited on by transactions.

Controller periodically collects these tables from each site.

Constructs a WFG from transactions common to both the tables.

No cycle, no deadlocks.

A cycle means a deadlock.

BITS Pilani

Pilani Campus


(CS ZG623)


PANKAJ VYAS

Department of CS/IS

([email protected])

Lecture 7

Sunday, Aug 31-2014

What Covered in Last Class

- Introduction to Deadlocks

- Centralized Algorithms

- Ho-Ramamoorthys 2-Phase

- Ho-Ramamoorthys 1-Phase

Todays Lecture

- Distributed Deadlock Detection Algorithms

- Path Pushing [Will not be Covered in Syllabus]

- Edge Chasing [Examples :CMH for AND Model ,

Sinha-Natarajan. Mitchell-Merritt Algorithm ]

- Diffusion Computation [Examples :CMH for OR Model,

Chandy-Herman]

-Global State Detection [Will not be Covered in Syllabus]

Todays Lecture

3

Distributed Algorithms

Path-pushing: resource dependency information disseminated through designated paths (in the graph) [Examples : Menasce-Muntz & Obermarck]

[Are not included in Syllabus]

Edge-chasing: special messages or probes circulated along edges of WFG. Deadlock exists if the probe is received back by the initiator. [Examples :CMH for AND Model , Sinha-Natarajan]

Diffusion computation: queries on status sent to process in WFG. [Examples :CMH for OR Model, Chandy-Herman]

Global state detection: get a snapshot of the distributed system. [Examples :Bracha-Toueg,Kshemkalyani-Singhal]

[Are not included in Syllabus]

4

Edge-Chasing Algorithm

Chandy-Misra-Haass Algorithm (AND MODEL):

A probe(i, j, k) is used by a deadlock detection process Pi. This probe is sent by the home site of Pj to Pk.

This probe message is circulated via the edges of the graph. Probe returning to Pi implies deadlock detection.

Terms used:

Pj is dependent on Pk, if a sequence of Pj, Pi1,.., Pim, Pk exists.

Pj is locally dependent on Pk, if above condition + Pj,Pk on same site.

Each process maintains an array dependenti: dependenti(j) is true if Pi knows that Pj is dependent on it. (initially set to false for all i & j).

5

Chandy-Misra-Haass Algorithm Sending the probe:

if Pi is locally dependent on itself then deadlock.

else for all Pj and Pk such that

(a) Pi is locally dependent upon Pj, and

(b) Pj is waiting on Pk, and

(c ) Pj and Pk are on different sites, send probe(i,j,k) to the home

site of Pk.

Receiving the probe:

if (d) Pk is blocked, and

(e) dependentk(i) is false, and

(f) Pk has not replied to all requests of Pj,

then begin

dependentk(i) := true;

if k = i then Pi is deadlocked

else ...

6

Chandy-Misra-Haass Algorithm

Receiving the probe:

.

else for all Pm and Pn such that

(a) Pk is locally dependent upon Pm, and

(b) Pm is waiting on Pn, and

(c) Pm and Pn are on different sites, send probe(i,m,n)

to the home site of Pn.

end.

Performance:

For a deadlock that spans m processes over n sites, m(n-1)/2 messages

are needed.

Size of the message 3 words.

Delay in deadlock detection O(n).

7

C-M-H Algorithm: Example

P1 P2 P6 P7

P3

P4 P5

Site 1

Site 2

Site 3

( 1,1,2 )

( 1,2,3 )

( 1,2,4 )

( 1,4,5 )

( 1,5,6 )

( 1,6,7 )

( 1,7,1 )

P1 initiates Deadlock

Detection by sending

Probe Message (1,1,2) to

P2

8

Diffusion-based Algorithm

CMH Algorithm for OR Model Initiation by a blocked process Pi:

send query(i,i,j) to all processes Pj in the dependent set DSi of Pi;

num(i) := |DSi|; waiti(i) := true;

Blocked process Pk receiving query(i,j,k):

if this is engaging query for process Pk /* first query from Pi */

then send query(i,k,m) to all Pm in DSk;

numk(i) := |DSk|; waitk(i) := true;

else if waitk(i) then send a reply(i,k,j) to Pj.

Process Pk receiving reply(i,j,k)

if waitk(i) then

numk(i) := numk(i) - 1;

if numk(i) = 0 then

if i = k then declare a deadlock.

else send reply(i, k, m) to Pm, which sent the engaging query.

9

Diffusion Algorithm: Example

11

Engaging Query

How to distinguish an engaging query? query(i,j,k) from the initiator contains a unique

sequence number for the query apart from the tuple (i,j,k).

This sequence number is used to identify subsequent queries.

(e.g.,) when query(1,7,1) is received by P1 from P7, P1 checks the sequence number along with the tuple.

P1 understands that the query was initiated by itself and it is not an engaging query.

Hence, P1 sends a reply back to P7 instead of forwarding the query on all its outgoing links.

Mitchell-Merritt Algorithm

(Edge-Chasing Category)

Each Node has two labels : Public & Private

aos lectures

Documents