synchronization primitives

Synchronization Primitives

Several synchronization primitives have been introduced to aid in

multithreading the kernel. These primitives are implemented by atomic

operations and use appropriate memory barriers so that users of these

primitives do not have to worry about doing it themselves. The primitives are

very similar to those used in other operating systems including mutexes,

condition variables, shared/exclusive locks, and semaphores.

Mutexes

The mutex primitive provides mutual exclusion for one or more data objects.

Two versions of the mutex primitive are provided: spin mutexes and sleep

mutexes.

Spin mutexes are a simple spin lock. If the lock is held by another thread when

a thread tries to acquire it, the second thread will spin waiting for the lock to

be released. Due to this spinning nature, a context switch cannot be performed

while holding a spin mutex to avoid deadlocking in the case of a thread owning

a spin lock not being executed on a CPU and all other CPUs spinning on that

lock. An exception to this is the scheduler lock, which must be held during a

context switch. As a special case, the ownership of the scheduler lock is passed

from the thread being switched out to the thread being switched in to satisfy

this requirement while still protecting the scheduler data structures. Since the

bottom half code that schedules threaded interrupts and runs non-threaded

interrupt handlers also uses spin mutexes, spin mutexes must disable

interrupts while they are held to prevent bottom half code from deadlocking

against the top half code it is interrupting on the current CPU. Disabling

interrupts while holding a spin lock has the unfortunate side effect of

increasing interrupt latency.

To work around this, a second mutex primitive is provided that performs a

context switch when a thread blocks on a mutex. This second type of mutex is

dubbed a sleep mutex. Since a thread that contests on a sleep mutex blocks

instead of spinning, it is not susceptible to the first type of deadlock with spin

locks. Sleep mutexes cannot be used in bottom half code, so they do not need

to disable interrupts while they are held to avoid the second type of deadlock

with spin locks.

As with Solaris, when a thread blocks on a sleep mutex, it propagates its

priority to the lock owner. Therefore, if a thread blocks on a sleep mutex and

its priority is higher than the thread that currently owns the sleep mutex, the

current owner will inherit the priority of the first thread. If the owner of the

sleep mutex is blocked on another mutex, then the entire chain of threads will

be traversed bumping the priority of any threads if needed until a runnable

thread is found. This is to deal with the problem of priority inversion where a

lower priority thread blocks a higher priority thread. By bumping the priority of

the lower priority thread until it releases the lock the higher priority thread is

blocked on, the kernel guarantees that the higher priority thread will get to run

as soon as its priority allows.

These two types of mutexes are similar to the Solaris spin and adaptive

mutexes. One difference from the Solaris API is that acquiring and releasing a

spin mutex uses different functions than acquiring and releasing a sleep mutex.

A difference with the Solaris implementation is that sleep mutexes are not

adaptive. Details of the Solaris mutex API and implementation can be found in

section 3.5 of [Mauro01].

Condition Variables

Condition variables provide a logical abstraction for blocking a thread while

waiting for a condition. Condition variables do not contain the actual condition

to test, instead, one locks the appropriate mutex, tests the condition, and then

blocks on the condition variable if the condition is not true. To prevent lost

wakeups, the mutex is passed in as an interlock when waiting on a condition.

FreeBSD's condition variables use an API quite similar to those provided in

Solaris. The only differences being the lack of a cv_wait_sig_swap and the

addition of cv_init and cv_destroy constructors and destructors. The

implementation also differs from Solaris in that the sleep queue is embedded

in the condition variable itself instead of coming from the hashed pool of sleep

queue's used by sleep andwakeup.

Shared/Exclusive Locks

Shared/Exclusive locks, also known as sx locks, provide simple reader/writer

locks. As the name suggests, multiple threads may hold a shared lock

simultaneously, but only one thread may hold an exclusive lock. Also, if one

thread holds an exclusive lock, no threads may hold a shared lock.

FreeBSD's sx locks have some limitations not present in other reader/writer

lock implementations. First, a thread may not recursively acquire an exclusive

lock. Secondly, sx locks do not implement any sort of priority propagation.

Finally, although upgrades and downgrades of locks are implemented, they

may not block. Instead, if an upgrade cannot succeed, it returns failure, and

the programmer is required to explicitly drop its shared lock and acquire an

exclusive lock. This design was intentional to prevent programmers from

making false assumptions about a blocking upgrade function. Specifically, a

blocking upgrade must potentially release its shared lock. Also, another thread

may obtain an exclusive lock before a thread trying to perform an upgrade. For

example, if two threads are performing an upgrade on a lock at the same time.

Semaphores

Semaphore is used as a synchronization tool. A semaphore S is an integer

variable that, apart from initialization, is accessed only through two standard

atomic operations: wait and signal. P(for Wait) and V(for Signal).

The classical definition of Wait in pseudocode is

Wait(S) {

While(S<=0)

;//no-op

S--;

}

For Signal

Signal(S){

S++;

} When one process modifies the semaphore value, no other process can

simultaneously modify that same semaphore value.

Atomic Primitives

Atomic primitives are arguably the most important tool in programming that

requires coordination between multiple threads and/or processors/cores.

There are four basic types of atomic primitives: swap, fetch and phi, compare

and swap, and load linked/store conditional. Usually these operations are

performed on values the same size as or smaller than a word (the size of the

processor's registers), though sometimes consecutive (single address)

multiword versions are provided.

The most primitive (pun not intended) is the swap operation. This operation

exchanges a value in memory with a value in a register, atomically setting the

value in memory and returning the original value in memory. This is not

actually very useful, in terms of multiprogramming, with only a single practical

use: the construction of test and set (TAS; or test and test and set - TATAS)

spinlocks. In these spinlocks, each thread attempting to enter the spinlock

spins, exchanging 1 into the spinlock value in each iteration. If 0 is returned,

the spinlock was free, and the thread now owns the spinlock; if 1 is returned,

the spinlock was already owned by another thread, and the thread attempting

to enter the spinlock must keep spinning. Ultimately, the owning thread leaves

the spinlock by setting the spinlock value to 0.

C/C++ - Code

1. ATOMIC word swap(volatile word &value, word newValue)

2. {

3. word oldValue = value;

4. value = newValue;

5.

6. return oldValue;

7. }

8.

9. void EnterSpinlock(volatile word &value)

10. { 11. // Test and test and set spinlock

12. while (value == 1 || swap(value, 1) == 1) {}

13. } 14. 15. void LeaveSpinlock(volatile word &value)

16. { 17. value = 0;

18. }

This makes for an extremely crude spinlock. The fact that there is only a single

value all threads share means that spinning can create a lot of cache coherency

traffic, as all spinning processors will be writing to the same address,

continually invalidating each other's caches. Furthermore, this mechanism

precludes any kind of order preservation, as it's impossible to distinguish when

a given thread began waiting; threads may enter the spinlock in any order,

regardless of how long any thread has been waiting to enter the spinlock.

Next up the power scale is the fetch and phi family. All members of this family

follow the same basic process: a value is atomically read from memory,

modified by the processor, and written back, with the original value returned

to the program. The modification performed can be almost anything; one of

the most useful modifications is the add operation (in this case it's called fetch

and add). The fetch and add operation is notably more useful than the swap

operation, but is still less than ideal; in addition to test and set spinlocks, fetch

and add can be used to create thread-safe counters, and spinlocks that both

preserve order and (potentially) greatly reduce cache coherency traffic.

1. ATOMIC word fetch_and_add(volatile word &value, word addend)

2. {


4. value += addend;

5.

6. return oldValue;

7. }

8.

9. void EnterSpinlock (volatile word &seq, volatile word &cur)

10. { 11. word mySeq = fetch_and_add(seq, 1);

12. while (cur != mySeq) {}

13. } 14. 15. void LeaveSpinlock(volatile word &cur)

16. { 17. cur++;

18. }

Next is the almost ubiquitous compare and swap (CAS) operation. In this

operation, a value is read from memory, and if it matches a comparison value

in the processor, a third value in the processor is atomically written in the

memory location, and ultimately the original value in memory is returned to

the program. This operation is very popular because it allows you to perform

almost any operation on a single memory location atomically, by reading the

value, modifying it, and then compare-and-swapping it back to memory. For

this reason, it is considered to be a universal atomic primitive.

1. ATOMIC word compare_and_swap(volatile word &value, word

newValue, word comparand)

2. {


4. if (oldValue == comparand)

5. value = newValue;

6.

7. return oldValue;

8. }

Some architectures (such as the x86) support double-width compare and swap

(atomic compare and swap of two consecutive words). While this is

convenient, it should not be relied upon in portable software, as many

architectures do not support it. Note that double-width compare and swap is

NOT the same as double compare and swap (which we'll look at soon).

Another universal atomic primitive is the load linked/store conditional (LL/SC)

pair of instructions. In this model, when a value is loaded linked into a register,

a reservation is placed on that memory address. If that reservation still exists

when the store conditional operation is performed, the store will be

performed. If another thread writes to the memory location between a load

linked and store conditional, the reservation is cleared, and any other threads

will fail their store conditional (resulting in skipping the store altogether). If the

store conditional fails, the program must loop back and try the operation

again, beginning with the load linked.

In theory, the load linked/store conditional primitive is superior to the

compare and swap operation. As any access to the memory address will clear

the reservations of other processors, the total number of reads is 1, in contrast

to the 2 reads with compare and swap (1 read to get the value initially, and a

second read to verify the value is unchanged). Furthermore, the LL/SC

operation may distinguish whether a write has occurred, regardless of whether

the value was changed by the write (we'll come back to why this is a good

thing when we discuss hazard pointers). Unfortunately, my research indicates

that most implementations of LL/SC do not guarantee that a write will be

distinguished if the written value is the same as the value at the time of the

reserve.

Finally, the wild card: the double compare and swap (DCAS). In a double

compare and swap, the compare and swap operation is performed

simultaneously on the values in two memory addresses. Obviously this

provides dramatically more power than any previous operation, which only

operate on single addresses. Unfortunately, support for this primitive is

extremely rare in real-world processors, and it is typically only used by lock-

free algorithm designers that are unable to reduce their algorithm to single-

address compare and swap operations.

1. ATOMIC void double_compare_and_swap(volatile word &value1,

word newValue1, word comparand1, word &oldValue1, volatile

word &value2, word newValue2, word comparand2, word

&oldValue2)

2. {

3. oldValue1 = value1;

4. oldValue2 = value2;

5.

6. if (oldValue1 == comparand1 && oldValue2 == comparand2)

7. {

8. value1 = newValue1;

9. value2 = newValue2;

10. }

11. }

Lastly, note that I'm using rather loose classification of these primitives.

Technically, you could place the swap (fetch and store) operations in the fetch

and phi family, as well, but it seemed more intuitive to me to separate them.

Ticket Lock

A ticket lock is a form of lockless inter-thread synchronization.

Overview

Conventionally, inter-thread synchronization is achieved by using

synchronization entities provided by the operating system, such as events,

semaphores and mutexes.For example, a thread will create a mutex and in the

act of creation, "claim" the mutex. A mutex can have only one owner at any

time. Other threads, when they come to the point where their behaviour must

be synchronized, will attempt to "claim" the mutex; if they cannot, because

another thread already owns the mutex, they automatically sleep until the

thread which currently owns the mutex gives it up. Then one of the currently

sleeping threads will automatically be awoken and given ownership of the

mutex.

A ticket lock works as follows; there are two integer values which begin at 0.

The first value is the queue ticket, the second is the dequeue ticket.

When a thread arrives, it atomically obtains and then increments the queue

ticket. It then atomically compares its ticket with the dequeue ticket. If they

are the same, the thread is permitted to enter the serialised code. If they are

not the same, then another thread must already be in the serialised code and

this thread must busy-wait or yield. When a thread has comes to leave the

serialised code, it atomically increments the dequeue ticket, thus permitting

the next waiting thread to enter the serialised code.

This approach is heavyweight (high-overhead, code intensive), in that such

entities have a significant impact upon the performance of the operating

system, since the operating system has to switch into a special mode when

dealing with operations upon synchronization entities to ensure they provide

synchronized behaviour.

A further drawback is that if the thread which owns the mutex fails, the entire

application halts. (This type of problem applies to all synchronization entities).

Lockless locking

Main article: Lock-free and wait-free algorithms

Lockless locking achieves inter-thread synchronization without the use of

operating system provided sychronization entities.Generally, two techniques

are involved.

The first technique is the use of a special set of instructions which are

guaranteed atomic by the CPU. This generally centers on an instrument known

as Compare-and-swap. This instruction compares two variables and if they are

the same, replaces one of the variable's values with a third value.

For example;compare_and_swap ( destination, exchange, comparand );

Here the destination and comparand are compared. If they are identical, the

value in exchange is placed in destination.

This first technique provides the vital inter-thread atomicity of operation. It

would otherwise be impossible to perform any lockless locking, since on

systems with multiple processors, even operations such as a simple assignment

(let a = b) could be half-way through occurring while other operations occur on

other processors; the software trying to sychronize across processors could

never be sure that any operation had reached a sane state permitting it to be

used in some way.

The second technique is to busy-wait or yield when it is clear that the current

thread cannot be permitted to continue processing. This provides the vital

ability to defer the processing done by a thread when such processing would

violate the operations which are being serialised between threads.

Test & Test & Set Lock

In computer science, the test-and-set CPU instruction is used to

implement mutual exclusion in multiprocessor environments. Although a

correct lock can be implemented with test-and-set, it can lead to memory

contention in busy lock (caused by bus locking and cache invalidation when

test-and-set operation needs to access memory atomically).

To lower the overhead a more elaborate locking protocol test and test-and-

set is used. The main idea is not to spin in test-and-set but increase the

likelihood of successful test-and-set by using the following entry protocol to

the lock:

boolean locked := false // shared lock variable procedure EnterCritical()

{ do { while (locked == true) skip // spin until

lock seems free } while TestAndSet(locked) // actual atomic locking }

Exit protocol is:

procedure ExitCritical() { locked := false }

The entry protocol uses normal memory reads to spin, waiting for the lock to

become free. Test-and-set is only used to try to get the lock when normal

memory read says it's free. Thus the expensive atomic memory operations

happens less often than in simple spin around test-and-set.

If the programming language used supports short-circuit evaluation, the entry

protocol could be implemented as:

procedure EnterCritical() { while ( locked == true or TestAndSet(locked) == true

) skip // spin until locked }

Array Based Locks

In array-based locks each lock is implemented as an array of size p where p is

the total number of processors. The lock acquire involves atomically copying

and incrementing a shared index variable and spinning on the array location

indexed by the copied value. Releasing a lock involves resetting the

corresponding array location and setting the next array location if a processor

is waiting. Each array location should be allocated on a different cache line to

avoid invalidations due to false-sharing. Assume a large enough cache line size

(say, 256 bytes) and put that amount of padding between consecutive lock

locations. Use LL/SC for implementing any atomic code section that may be a

part of lock acquire or release. The algorithm should be written in C with calls

to atomic primitives implemented with LL/SC in MIPS assembly.

It Solves the O (P2) traffic problem.

Barriers

Race conditions can cause a program to execute in a non-deterministic fashion,

producing inconsistent results. Synchronization routines are used to remove

race conditions from a code. In certain cases all threads must have executed a

portion of code before continuing. Barrier synchronization is a technique to

achieve this. A thread executing an episode of a barrier waits for all other

threads before proceeding to the next. Therefore, when a barrier is reached,

all threads are forced to wait for the last thread to arrive. Use of barriers is

common in shared memory parallel programming.

Some terminologies in the context of this report:

1. Threads – Units of execution initiated by the program using openMp which

might run on different processors but on the same parallel machine in which

they were initiated.

2. Parallel Machine – A collection of processors which are tightly coupled and

use shared memory. Jedi1 is considered a parallel machine. It has 4 processors.

3. Processor – A subunit of a parallel machine.

Two types of Barrier are

1.Centralized Barrier

2.Tree Barrier

Centralized Barrier

Each of the N threads entering the barrier atomically decrements a shared

integer, the initial value of which is N. If the value of the integer after

decrement is 0 then the thread resets the counter and changes a shared

central flag. Otherwise, the thread waits for notification. Arriving processor

decrements “count” and then wait until “sense” has a different value than it

had in the previous barrier. The last arriving processor resets “count” and

reverses “sense”. The algorithm relies on every thread reading and writing to a

single memory location: the counter. This memory location is known as a hot-

spot.

Tree Barrier

Does not need a lock, only uses flags

-Arrange the processors logically in a binary tree (higher degree also possible)

-Two siblings tell each other of arrival via simple flags (i.e. one waits on a flag

while the other sets it on arrival)

-One of them moves up the tree to participate in the next level of the barrier

- Introduces concurrency in the barrier algorithm since independent subtrees

can proceed in parallel

- Takes log(P) steps to complete the acquire

- A fixed processor starts a downward pass of release waking up other

processors that in turn set other flags

- Shows much better scalability compared to centralized barriers in DSM

multiprocessors; the advantage in small bus-based systems is not much, since

all transactions are any way serialized on the bus; in fact the additional log (P)

delay may hurt performance in bus-based SMPs

Binary Barrier Tree Signal

Binary barrier tree wakeup

Each processor will need at most log (P) + 1 flags

Avoid false sharing: allocate each processor’s flags on a separate chunk of

cache lines

With some memory wastage (possibly worth it) allocate each processor’s flags

on a separate page and map that page locally in that processor’s physical

memory.

Avoid remote misses in DSM multiprocessor

Does not matter in bus-based SMPs

What is Sense Reversal?

• Problem: reinitialization

—each time a barrier is used, it must be reset

• Difficulty: odd and even barriers can overlap in time

—some processes may still be exiting the kth barrier

—other processes may be entering the k+1st barrier

—how can one reinitialize?

• Solution: sense reversal

—terminal state of one phase is initial state of next phase

—e.g.

– odd barriers wait for flag to transition from true to false

– even barriers wait for flag to transition from false to true

• Benefits

—reduce number of references to shared variables

—eliminate one of the spinning episodes

synchronization primitives

Documents