11 lock freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Lock Freedom

David Chisnall

March 8, 2011 (Pancake Day!)


Locks are Slow!

• Cost to acquire and release

• System calls often required

• Can cause n threads to block (wait) if a lock is accessible byn + 1 threads

• Possibility of deadlock

• Not ideal for high-performance computing!

Author's Note

Comment

Lock freedom doesn't mean avoiding locks, but it does mean avoiding some of the problems that are typically associated with the simple lock-based solutions that we looked at last lecture.


Wait Freedom

Every operation is bounded on the number of steps beforecompletion.

(Never happens, back in the real world)

Author's Note

Comment

Wait free algorithms are really nice, but really rare. Generally, you only see them for embarrassingly parallel problems, like ray tracing.


Lock Freedom

• At least one thread must be able to make progress at anygiven time

• Eventually, all threads must make progress

• Given infinite time, infinitely many threads will progress

Author's Note

Comment

Lock free algorithms are a lot more common, and scale pretty well. All wait-free algorithms are also lock free.


Obstruction Freedom

A single thread, with all other threads paused, may complete itswork.

Author's Note

Comment

Obstruction free algorithms are a bit less interesting. All lock-free algorithms are also obstruction free, but an algorithm that is jut obstruction free may not scale very well. Obstruction freedom just means no obstructions to algorithm progress. For good performance, you want progress on at least as many threads as you have processors, which means something between obstruction free and lock free.


Implementing Obstruction Free Algorithms

• Requires strong guarantees on memory ordering

• Needs lots of thought!

Author's Note

Comment

Designing obstruction free algorithms typically involves making sure that operations happen within a thread in a very specific order. There are some serious difficulties with this, however.


Problem 1: Compiler Reorders Memory Access

�a = b;

b = c; � �• Two store operations

• No dependencies

• Compiler is free to issue them in any order

• May also remove load operations if the value is already in aregister!

Author's Note

Comment

Difficulty 1: the compiler hates you and will try to make this thread faster at the expense of breaking other threads. You have to be very careful about this.


The volatile Keyword

�volatile int a; � �

• The compiler must issue a memory read for every access to a

• The compiler must issue a memory write for every assignmentto a

• The compiler may not re-order accesses and assignments to a

• The compiler is free to rearrange accesses to a relative toother memory access

• The compiler makes no guarantees about multithreaded access

Author's Note

Comment

Volatile was added to the C spec for doing memory mapped I/O, but accessing memory from two threads has roughly the same set of requirements. Volatile means that the compiler may not remove or reorder memory accesses to a variable.


Problem 2: CPU Reorders Memory Access

• Most modern chips issue operations out of order

• Memory reads and writes may be reordered

• The processor will ensure that the current thread doesn’t seethe reordering...

• ...but other threads still can

Author's Note

Comment

Second problem: the CPU hates you too! Out of order processors will do the same sort of data shuffling as compilers, just to a slightly lesser degree. They'll have some internal logic ensuring that you don't notice this rearrangement from within one core, but you may notice it from concurrent threads running on different cores. This is especially problematic if you have a non-cache-coherent NUMA system.


Memory Barriers

�// GCC extension , full memory barrier:

__sync_synchronize (); � �• Provides a line in the instruction stream

• Memory accesses may not be reordered across the line

• Some architectures provide various forms of relaxed barriers(e.g. only writes may not be reordered)

Author's Note

Comment

Various architectures have different barrier instructions. These prevent the CPU from reordering load / store instructions. GCC provides an intrinsic instruction that issues a full barrier (i.e. all memory operations before it must complete before any memory operations after it).


Example: Xen Time Source

• Hypervisor must provide guest VMs with current time

• Desire to avoid expensive calls from guest to hypervisor

• Lock-free mechanism for updating time

Author's Note

Comment

This is a fairly common problem: you have some shared data which is frequently read and infrequently written to. It's more than one word, so you can't use an atomic operation to access it, what do you do? The simple solution would be to protect each read / write with a lock, but that gets a bit messy.


Time in Xen

• Hypervisor provides coarse-grained time and time-stampcounter (TSC) value when it was accurate

• Generating the current time requires reading several valuesfrom memory

• What happens if your read overlaps with an update?

Author's Note

Comment

The hypervisor is doing periodic updates, the VM is doing frequent reads.


Solution: Versioned Reads�struct shared_info

{

int version , nanosecs , seconds , tscs;

};

struct shared_info atomic_read(volatile struct

shared_info *info)

{

struct ret;

while ((ret ->version = info ->version) & 1) ;

ret ->nanosecs = info ->nanosecs;

ret ->seconds = info ->seconds;

ret ->tscs = info ->seconds;

if (ref ->version == info ->version)

return ret;

return atomic_read(info);

} � �

Author's Note

Comment

This algorithm shows how the atomic read works. First, it spins while the low bit of the version is 1, which means that an update is in progress. Then it reads all of the values. Then it reads the three values and checks the version again. If the version has changed, then there an update has started, so it tries again. If not, then it returns.


Write Algorithm

�info ->version ++;

__sync_synchronize ();

info ->nanosecs = nanosecs;

info ->seconds = seconds;

info ->tscs = seconds;

__sync_synchronize ();

info ->version ++; � �

Author's Note

Comment

The update function is also simple. It increments the version, so the other thread will note an update in progress, then it does the update, and increments the counter again. The two memory barriers ensure that the counter increments complete before and after the update, not interleaved with them.


Performance

Reader:

• No atomic operations required

• Common case just requires 5 reads

• Very fast!

• May need to retry if concurrent with write

• Unbounded worst-case time

Writer:

• Needs two barriers or atomic increments

• Similar cost to acquiring and releasing a mutex

• But never blocks - hard realtime guarantee for the writer!

Author's Note

Comment

The reader, in the common case (when it's not concurrent with the writer) is incredibly fast. Only slightly slower than a non-thread-safe version. In the less common case, it can be delayed. The writer has hard realtime guarantees, because the reader can never block the writer. Note that this is the opposite of using a read-write lock, where the readers can indefinitely block the writer. This way around is usually better, because readers typically want the most up-to-date value.


Example: Lockless Ring Buffer

• Producer-consumer problem

• Solution without locks

• Producer and consumer can both access queue concurrently!

Author's Note

Comment

Ring buffers are a constant-space solution to the producer-consumer problem. They are a block of memory with insert and read points. Inserting happens along the buffer, then wraps around to the start. Reading follows the insert point, consuming data in the order that it is inserted. Some mechanism should ensure that the insert point never overtakes the read point, or data will be overwritten before it is read (some sound cards do this, because losing data is better than pausing).


Simple Ring Buffer

1. Acquire lock

2. Insert object

3. Release lock

1. Acquire lock

2. Collect object

3. Release lock

How do we make this lock free?

Author's Note

Comment

In this simple solution (which we looked at last time), the producer and consumer threads can both block each other while trying to access the queue.


Potential Concurrency Problems

• Producer must find free space

• Consumer must find next item

• Producer must be able to tell if the buffer is full

• Consumer must be able to tell if the buffer is empty

Author's Note

Comment

All four of these things depend on the state of the ring buffer, so they're interaction points between the threads, and potential places for concurrency bugs to hide.


Solution: Free-running Counters

�volatile uint32_t producer;

volatile uint32_t consumer;

int shift = 8;

// Must be power of two!

const bufferSize = 1<<shift;

const bufferMask = bufferSize - 1;

void *buffer[bufferSize ]; � �

Author's Note

Comment

With free-running counters and a power-of-two size buffer, we can translate from the counter to an index by just masking the low bits, and can get the amount of space in the buffer by subtracting the consumer counter from the producer counter.


Inserting into the Ring

�void insert(void *v)

{

while (producer - consumer > bufferSize);

buffer[producer & bufferMask] = v;

producer ++;

} � �

Author's Note

Comment

If the buffer is full, then spin. If not, then insert the new value at the current insert point and increment the producer counter.


Retrieving the Next Value

�void* fetch(void)

{

while (producer == consumer);

void *v = buffer[consumer & bufferMask ];

consumer ++;

return v;

} � �

Author's Note

Comment

Spin while the queue is empty. When it isn't, grab the next value and increment the consumer counter.


Why it Works

• The counters are only ever read from one thread, written fromthe other

• Each update only increments a single shared variable

• Wrap-around is handled by overflow

Author's Note

Comment

The counters are only touched by one thread, so atomic increments are not needed - the worst that will happen is that the other thread sees the old value. Wrap around to the start of the buffer happens automatically due to standard integer overflow semantics.


Better than Condition Variables?

• Very fast when the consumer and producer are both running

• Causes producer to spin when queue is full

• Causes consumer to spin when queue is empty

• Solution: Hybrid - use a condition variable only on boundaryconditions (i.e. when transitioning to and from full / emptystates)

Author's Note

Comment

This has the same problem as the solution involving a mutex - the threads spin when the queue is empty / full. You can improve this by combining it with a condition variable and using that when the queue is empty.


Questions?

11 lock freedom

Documents