11 lock freedom

24
Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms Lock Freedom David Chisnall March 8, 2011 (Pancake Day!)

Upload: savio77

Post on 02-May-2017

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Lock Freedom

David Chisnall

March 8, 2011 (Pancake Day!)

Page 2: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Locks are Slow!

• Cost to acquire and release

• System calls often required

• Can cause n threads to block (wait) if a lock is accessible byn + 1 threads

• Possibility of deadlock

• Not ideal for high-performance computing!

Author's Note
Comment
Lock freedom doesn't mean avoiding locks, but it does mean avoiding some of the problems that are typically associated with the simple lock-based solutions that we looked at last lecture.
Page 3: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Wait Freedom

Every operation is bounded on the number of steps beforecompletion.

(Never happens, back in the real world)

Author's Note
Comment
Wait free algorithms are really nice, but really rare. Generally, you only see them for embarrassingly parallel problems, like ray tracing.
Page 4: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Lock Freedom

• At least one thread must be able to make progress at anygiven time

• Eventually, all threads must make progress

• Given infinite time, infinitely many threads will progress

Author's Note
Comment
Lock free algorithms are a lot more common, and scale pretty well. All wait-free algorithms are also lock free.
Page 5: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Obstruction Freedom

A single thread, with all other threads paused, may complete itswork.

Author's Note
Comment
Obstruction free algorithms are a bit less interesting. All lock-free algorithms are also obstruction free, but an algorithm that is jut obstruction free may not scale very well. Obstruction freedom just means no obstructions to algorithm progress. For good performance, you want progress on at least as many threads as you have processors, which means something between obstruction free and lock free.
Page 6: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Implementing Obstruction Free Algorithms

• Requires strong guarantees on memory ordering

• Needs lots of thought!

Author's Note
Comment
Designing obstruction free algorithms typically involves making sure that operations happen within a thread in a very specific order. There are some serious difficulties with this, however.
Page 7: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Problem 1: Compiler Reorders Memory Access

�a = b;

b = c; � �• Two store operations

• No dependencies

• Compiler is free to issue them in any order

• May also remove load operations if the value is already in aregister!

Author's Note
Comment
Difficulty 1: the compiler hates you and will try to make this thread faster at the expense of breaking other threads. You have to be very careful about this.
Page 8: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

The volatile Keyword

�volatile int a; � �

• The compiler must issue a memory read for every access to a

• The compiler must issue a memory write for every assignmentto a

• The compiler may not re-order accesses and assignments to a

• The compiler is free to rearrange accesses to a relative toother memory access

• The compiler makes no guarantees about multithreaded access

Author's Note
Comment
Volatile was added to the C spec for doing memory mapped I/O, but accessing memory from two threads has roughly the same set of requirements. Volatile means that the compiler may not remove or reorder memory accesses to a variable.
Page 9: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Problem 2: CPU Reorders Memory Access

• Most modern chips issue operations out of order

• Memory reads and writes may be reordered

• The processor will ensure that the current thread doesn’t seethe reordering...

• ...but other threads still can

Author's Note
Comment
Second problem: the CPU hates you too! Out of order processors will do the same sort of data shuffling as compilers, just to a slightly lesser degree. They'll have some internal logic ensuring that you don't notice this rearrangement from within one core, but you may notice it from concurrent threads running on different cores. This is especially problematic if you have a non-cache-coherent NUMA system.
Page 10: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Memory Barriers

�// GCC extension , full memory barrier:

__sync_synchronize (); � �• Provides a line in the instruction stream

• Memory accesses may not be reordered across the line

• Some architectures provide various forms of relaxed barriers(e.g. only writes may not be reordered)

Author's Note
Comment
Various architectures have different barrier instructions. These prevent the CPU from reordering load / store instructions. GCC provides an intrinsic instruction that issues a full barrier (i.e. all memory operations before it must complete before any memory operations after it).
Page 11: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Example: Xen Time Source

• Hypervisor must provide guest VMs with current time

• Desire to avoid expensive calls from guest to hypervisor

• Lock-free mechanism for updating time

Author's Note
Comment
This is a fairly common problem: you have some shared data which is frequently read and infrequently written to. It's more than one word, so you can't use an atomic operation to access it, what do you do? The simple solution would be to protect each read / write with a lock, but that gets a bit messy.
Page 12: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Time in Xen

• Hypervisor provides coarse-grained time and time-stampcounter (TSC) value when it was accurate

• Generating the current time requires reading several valuesfrom memory

• What happens if your read overlaps with an update?

Author's Note
Comment
The hypervisor is doing periodic updates, the VM is doing frequent reads.
Page 13: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Solution: Versioned Reads�struct shared_info

{

int version , nanosecs , seconds , tscs;

};

struct shared_info atomic_read(volatile struct

shared_info *info)

{

struct ret;

while ((ret ->version = info ->version) & 1) ;

ret ->nanosecs = info ->nanosecs;

ret ->seconds = info ->seconds;

ret ->tscs = info ->seconds;

if (ref ->version == info ->version)

return ret;

return atomic_read(info);

} � �

Author's Note
Comment
This algorithm shows how the atomic read works. First, it spins while the low bit of the version is 1, which means that an update is in progress. Then it reads all of the values. Then it reads the three values and checks the version again. If the version has changed, then there an update has started, so it tries again. If not, then it returns.
Page 14: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Write Algorithm

�info ->version ++;

__sync_synchronize ();

info ->nanosecs = nanosecs;

info ->seconds = seconds;

info ->tscs = seconds;

__sync_synchronize ();

info ->version ++; � �

Author's Note
Comment
The update function is also simple. It increments the version, so the other thread will note an update in progress, then it does the update, and increments the counter again. The two memory barriers ensure that the counter increments complete before and after the update, not interleaved with them.
Page 15: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Performance

Reader:

• No atomic operations required

• Common case just requires 5 reads

• Very fast!

• May need to retry if concurrent with write

• Unbounded worst-case time

Writer:

• Needs two barriers or atomic increments

• Similar cost to acquiring and releasing a mutex

• But never blocks - hard realtime guarantee for the writer!

Author's Note
Comment
The reader, in the common case (when it's not concurrent with the writer) is incredibly fast. Only slightly slower than a non-thread-safe version. In the less common case, it can be delayed. The writer has hard realtime guarantees, because the reader can never block the writer. Note that this is the opposite of using a read-write lock, where the readers can indefinitely block the writer. This way around is usually better, because readers typically want the most up-to-date value.
Page 16: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Example: Lockless Ring Buffer

• Producer-consumer problem

• Solution without locks

• Producer and consumer can both access queue concurrently!

Author's Note
Comment
Ring buffers are a constant-space solution to the producer-consumer problem. They are a block of memory with insert and read points. Inserting happens along the buffer, then wraps around to the start. Reading follows the insert point, consuming data in the order that it is inserted. Some mechanism should ensure that the insert point never overtakes the read point, or data will be overwritten before it is read (some sound cards do this, because losing data is better than pausing).
Page 17: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Simple Ring Buffer

1. Acquire lock

2. Insert object

3. Release lock

1. Acquire lock

2. Collect object

3. Release lock

How do we make this lock free?

Author's Note
Comment
In this simple solution (which we looked at last time), the producer and consumer threads can both block each other while trying to access the queue.
Page 18: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Potential Concurrency Problems

• Producer must find free space

• Consumer must find next item

• Producer must be able to tell if the buffer is full

• Consumer must be able to tell if the buffer is empty

Author's Note
Comment
All four of these things depend on the state of the ring buffer, so they're interaction points between the threads, and potential places for concurrency bugs to hide.
Page 19: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Solution: Free-running Counters

�volatile uint32_t producer;

volatile uint32_t consumer;

int shift = 8;

// Must be power of two!

const bufferSize = 1<<shift;

const bufferMask = bufferSize - 1;

void *buffer[bufferSize ]; � �

Author's Note
Comment
With free-running counters and a power-of-two size buffer, we can translate from the counter to an index by just masking the low bits, and can get the amount of space in the buffer by subtracting the consumer counter from the producer counter.
Page 20: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Inserting into the Ring

�void insert(void *v)

{

while (producer - consumer > bufferSize);

buffer[producer & bufferMask] = v;

producer ++;

} � �

Author's Note
Comment
If the buffer is full, then spin. If not, then insert the new value at the current insert point and increment the producer counter.
Page 21: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Retrieving the Next Value

�void* fetch(void)

{

while (producer == consumer);

void *v = buffer[consumer & bufferMask ];

consumer ++;

return v;

} � �

Author's Note
Comment
Spin while the queue is empty. When it isn't, grab the next value and increment the consumer counter.
Page 22: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Why it Works

• The counters are only ever read from one thread, written fromthe other

• Each update only increments a single shared variable

• Wrap-around is handled by overflow

Author's Note
Comment
The counters are only touched by one thread, so atomic increments are not needed - the worst that will happen is that the other thread sees the old value. Wrap around to the start of the buffer happens automatically due to standard integer overflow semantics.
Page 23: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Better than Condition Variables?

• Very fast when the consumer and producer are both running

• Causes producer to spin when queue is full

• Causes consumer to spin when queue is empty

• Solution: Hybrid - use a condition variable only on boundaryconditions (i.e. when transitioning to and from full / emptystates)

Author's Note
Comment
This has the same problem as the solution involving a mutex - the threads spin when the queue is empty / full. You can improve this by combining it with a condition variable and using that when the queue is empty.
Page 24: 11 Lock Freedom

Obstruction Freedom Categories Guaranteeing Memory Ordering Lock-Free Algorithms

Questions?