replication and consistency - tu kaiserslautern · but idealized (we forgot to mention a few...

Post on 22-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Replication and Consistency08 Spin Locking and Contention

Annette Bieniusa

AG SoftechFB Informatik

TU Kaiserslautern

Annette Bieniusa Replication and Consistency 1/ 76

Thank you!

These slides are based on companion material of the following books:

The Art of Multiprocessor Programming by Maurice Herlihyand Nir ShavitSynchronization Algorithms and Concurrent Programmingby Gadi Taubenfeld

Annette Bieniusa Replication and Consistency 2/ 76

Previously on Replication and Consistency

ModelsAccurate (we never lied to you)But idealized (we forgot to mention a few things)

ProtocolsElegantEssentialBut naive

Annette Bieniusa Replication and Consistency 3/ 76

New Focus: Performance in Real Systems

ModelsMore complicated (more details)Still focus on principles (not soon to become obsolete)

ProtocolsElegant (in their fashion)Important (why else would we discuss them)And realistic (more optimizations will be possible, though)

Annette Bieniusa Replication and Consistency 4/ 76

Mutual Exclusion, revisited

Think of performance, not just correctness and progressBegin to understand how performance depends on our softwareproperly utilizing the multiprocessor machine’s hardwareAnd get to know a collection of locking algorithms

Annette Bieniusa Replication and Consistency 5/ 76

If a processor doesn’t get a lock . . .

QuestionWhat can the processor do?

Keep trying“spin” or “busy-wait” as with Filter and Bakery algorithmUseful on multi-processors if expected delays are short

Suspend and allow scheduler to schedule other processes“blocking’ ’ as with Java’s monitorsGood if delays are longAlways good on uniprocessors

In practise, often mix of both strategiesSpin for a short timeThen, suspend

Annette Bieniusa Replication and Consistency 6/ 76

If a processor doesn’t get a lock . . .

QuestionWhat can the processor do?

Keep trying“spin” or “busy-wait” as with Filter and Bakery algorithmUseful on multi-processors if expected delays are short

Suspend and allow scheduler to schedule other processes“blocking’ ’ as with Java’s monitorsGood if delays are longAlways good on uniprocessors

In practise, often mix of both strategiesSpin for a short timeThen, suspend

Annette Bieniusa Replication and Consistency 6/ 76

Basic Spin-Lock

Contention: Multiple threads try to acquire lock at the same timeHoch can we avoid or alleviate contention?

Annette Bieniusa Replication and Consistency 7/ 76

Test-and-Set (TAS) revisitedMachine-instruction on one word (here: for boolean values)Atomically, swap new value with prior value and return prior valueSwapping in true is called Test-And-SetAka getAndSet() in Java

\\ Package java.utitl.concurrent.atomic

public class AtomicBoolean boolean value;

// implemented as one hardware instructionpublic synchronized boolean getAndSet(boolean newValue)

boolean prior = value;value = newValue;return prior;

Annette Bieniusa Replication and Consistency 8/ 76

Task: Design a lock using Test-and-Set (TAS)!

class TASLock implements Lock

// if false, lock is free// if true, lock is takenAtomicBoolean state = new AtomicBoolean(false);

void lock() // TODO

void unlock() // TODO

Annette Bieniusa Replication and Consistency 9/ 76

Test-and-Set Lock

class TASLock

AtomicBoolean state = new AtomicBoolean(false);

void lock() while (state.getAndSet(true))

void unlock() state.set(false);

Annette Bieniusa Replication and Consistency 10/ 76

Space Complexity

TAS spin-lock has small “footprint”N thread spin-lock uses O(1) spaceAs opposed to O(N) Peterson/Bakery

QuestionHow did we overcome the Ω(N) lower bound?

⇒ Use an object with higher consensus number!

Annette Bieniusa Replication and Consistency 11/ 76

Space Complexity

TAS spin-lock has small “footprint”N thread spin-lock uses O(1) spaceAs opposed to O(N) Peterson/Bakery

QuestionHow did we overcome the Ω(N) lower bound?⇒ Use an object with higher consensus number!

Annette Bieniusa Replication and Consistency 11/ 76

Performance Evaluation

ExperimentSpawn N threadsIncrement shared counter 1 million timesWork is split between the threads, i.e. each thread does 106/NincrementsEach thread takes lock, increments a counter, releases lock

How long should it take?How long does it take?

Annette Bieniusa Replication and Consistency 12/ 76

HypothesisNo speedup because lock is sequential bottleneck (Amadahl’slaw!)

Annette Bieniusa Replication and Consistency 13/ 76

Mystery 1

A typical evaluation looks like this:

Annette Bieniusa Replication and Consistency 14/ 76

New approach: Test-and-Test-and-Set Locks

Lurking stageWait until lock seems to be freeSpin while read returns true (lock taken)

Pouncing stateAs soon as lock seems to be availableRead returns false (lock free)Call TAS to acquire lockIf TAS loses, back to lurking

Annette Bieniusa Replication and Consistency 15/ 76

Test-and-Test-and-Set Locks

class TTASLock extends TASLockvoid lock()

while (true) while (state.get()) // Lurkif (!state.getAndSet(true)) // Pounce

return;

Annette Bieniusa Replication and Consistency 16/ 76

Mystery 2

Both TAS and TTAS do the same thing in our modelBut TTAS performs much better in actual evaluationsNeither approach is ideal

Our memory abstraction is broken! We need a more detailed model!

Annette Bieniusa Replication and Consistency 17/ 76

Mystery 2

Both TAS and TTAS do the same thing in our modelBut TTAS performs much better in actual evaluationsNeither approach is ideal

Our memory abstraction is broken! We need a more detailed model!

Annette Bieniusa Replication and Consistency 17/ 76

Mystery 2

Both TAS and TTAS do the same thing in our modelBut TTAS performs much better in actual evaluationsNeither approach is ideal

Our memory abstraction is broken! We need a more detailed model!

Annette Bieniusa Replication and Consistency 17/ 76

Bus-Based Architectures

Random Access Memory (access time: 10s of cycles)Shared Bus as broadcast medium

One broadcaster at a timeOther processors and memory can passively listen

Per-Processor Caches (access time: 1-2 cycles)

Annette Bieniusa Replication and Consistency 18/ 76

Cache Coherence

We have lots of copies of dataOriginal copy in memoryCached copies at processors

If some processor modifies its own copy:What do we do with the others?How to avoid confusion about actual value?

Cache coherence protocol!

Annette Bieniusa Replication and Consistency 19/ 76

Cache Coherence

We have lots of copies of dataOriginal copy in memoryCached copies at processors

If some processor modifies its own copy:What do we do with the others?How to avoid confusion about actual value?

Cache coherence protocol!

Annette Bieniusa Replication and Consistency 19/ 76

Write-Back Caches

Idea: Accumulate changes in cache and write back when neededBecause we need cache for something elseOr because another processor wants to read the changed value

On first modification, invalidate all other entriesCache entry can be marked as dirty (i.e. it must be eventuallywritten back to main memory)

Annette Bieniusa Replication and Consistency 20/ 76

When a thread modifies its cache value, . . .

Annette Bieniusa Replication and Consistency 21/ 76

. . . it invalidates all other caches

Annette Bieniusa Replication and Consistency 22/ 76

When another thread want to read, . . .

Annette Bieniusa Replication and Consistency 23/ 76

. . . the owner responds

Annette Bieniusa Replication and Consistency 24/ 76

Mystery Explained!

TAS-LockSpinning threads invalidate cache line with TAS, keeps bus busyThreads wanting to release lock is delayed behind spinners

TTAS-LockThreads spin on local cacheNo bus use while lock is takenProblem: When lock is released, reads are satisfied sequentially onbusEventually system quiesces after lock has been acquired

→ quiescence time linear in number of threads for bus architecture

Annette Bieniusa Replication and Consistency 25/ 76

Solution: Introduce Delay

“If the lock looks free, but I fail to get it, there must be lots ofcontention!”

⇒ Better to back off than to collide again

Example: Exponential Backoff

If I fail to get lock

Wait random duration before retryEach subsequent failure doubles expected wait (up to fixedmaximum)

Annette Bieniusa Replication and Consistency 26/ 76

Solution: Introduce Delay

“If the lock looks free, but I fail to get it, there must be lots ofcontention!”

⇒ Better to back off than to collide again

Example: Exponential Backoff

If I fail to get lock

Wait random duration before retryEach subsequent failure doubles expected wait (up to fixedmaximum)

Annette Bieniusa Replication and Consistency 26/ 76

Exponential Backoff Lock

class Backoff extends TTASLock

void lock() int delay = MIN_DELAY;while (true)

while (state.get()) if (!lock.getAndSet(true))

return;// if not successful, we waitsleep(random() % delay);if (delay < MAX_DELAY)

delay = 2 * delay;

Annette Bieniusa Replication and Consistency 27/ 76

Exponential Backoff Lock

Easy to implementBut must choose parameters carefullyNot portable across platforms

Idea

Avoid useless invalidations by keeping a queue of threadsEach thread notifies next in line without bothering the others

Annette Bieniusa Replication and Consistency 28/ 76

Exponential Backoff Lock

Easy to implementBut must choose parameters carefullyNot portable across platforms

Idea

Avoid useless invalidations by keeping a queue of threadsEach thread notifies next in line without bothering the others

Annette Bieniusa Replication and Consistency 28/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 29/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 30/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 31/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 32/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 33/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 34/ 76

Anderson Queue Lock

Annette Bieniusa Replication and Consistency 35/ 76

Anderson Queue Lock

class ALock implements Lock boolean[] flags = true,false,...,false; // one per threadAtomicInteger next = new AtomicInteger(0);ThreadLocal<Integer> mySlot; // thread-local per thread

void lock() mySlot = next.getAndIncrement();while (!flags[mySlot % n]) ; //spinflags[mySlot % n] = false; // prepare for re-use (wrong inFigure!)

void unlock() flags[(mySlot+1) % n] = true; // tell next thread

Annette Bieniusa Replication and Consistency 36/ 76

Anderson Lock

FIFO fairness, no lockoutScalable performance

Threads spin on locally cached copy of single array locationBut beware of false sharing of items on the same cache line!Invalidations always per cache lineTrick: Use padding to avoid sharing

Not space-efficientRequires knowledge about number of threads

Annette Bieniusa Replication and Consistency 37/ 76

CLH Lock (by Craig, Landin, Hagersten)

Annette Bieniusa Replication and Consistency 38/ 76

CLH Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 39/ 76

CLH Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 40/ 76

CLH Lock: It’s a Queue!

Annette Bieniusa Replication and Consistency 41/ 76

CLH Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 42/ 76

CLH Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 43/ 76

Remarks

Threads spin on cached copy (efficient)Lock can reuse predecessor’s node for future lock accesses

Annette Bieniusa Replication and Consistency 44/ 76

CLH Lockclass Qnode

AtomicBoolean locked = new AtomicBoolean(true);

class CLHLock implements Lock

AtomicReference<Qnode> tail = new AtomicReference<Qnode>(null);ThreadLocal<Qnode> myNode = new Qnode(); // per thread

void lock() qnolde.locked = true;Qnode pred = tail.getAndSet(myNode); // swap my node intoqueuewhile (pred.locked) // spin

void unlock() myNode.locked = false;myNode = pred; // "reuse" predecessor's qnode (see book)

Annette Bieniusa Replication and Consistency 45/ 76

CLH Lock

Lock release affects only successorDoes not depend on prior knowledge about number of threadsFIFO FairnessBut doesn’t work (efficiently) for uncached NUMA architectures

Annette Bieniusa Replication and Consistency 46/ 76

NUMA Architectures

N on-U niform-M emomory-A rchitectureModel: Flat shared memory, no caches (in most variants)Some memory regions faster accessible than othersSpinning on remote memory is slow

Annette Bieniusa Replication and Consistency 47/ 76

MCS Lock (by Mellor-Crummey and Scott)

FIFO orderSpin on local memory onlySmall, constant-size overhead

Idea:

To acquire lock, place own Qnode at tail of listIf it has a predecessor, modify predecessor’s node to refer to ownQnode

Annette Bieniusa Replication and Consistency 48/ 76

MCS Lock

Annette Bieniusa Replication and Consistency 49/ 76

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 50/ 76

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 51/ 76

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 52/ 76

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 53/ 76

MCS Lock: Acquiring a lock

Annette Bieniusa Replication and Consistency 54/ 76

MCS Lock: Releasing a lock

Annette Bieniusa Replication and Consistency 55/ 76

MCS Lock

class Qnode boolean locked = false; // only reads/writes requiredQnode next = null;

Annette Bieniusa Replication and Consistency 56/ 76

MCS Lockclass MCSLock implements Lock

AtomicReference tail;ThreadLocal<Qnode> qnode = new Qnode();

void lock() // reset for reuseqnode.next = null;qnode.locked = false;

// swap my node inQnode pred = tail.getAndSet(qnode);

if (pred != null) // lock is taken, so set my status to waitqnode.locked = true;// tell predecessor where to find mepred.next = qnode;// spin on my nodewhile (qnode.locked)

...

Annette Bieniusa Replication and Consistency 57/ 76

MCS Lock: Releasing

Status of qnode.next indicates that other thread is activeNeed to wait for it to finish and start spinning

Annette Bieniusa Replication and Consistency 58/ 76

MCS Lock: Releasing

Annette Bieniusa Replication and Consistency 59/ 76

MCS Lock: Releasing

Annette Bieniusa Replication and Consistency 60/ 76

MCS Lock

void unlock() if (qnode.next == null)

// if really no thread waitingif (tail.compareAndSet(qnode, null)return;

// otherwise, wait for successor to finishwhile (qnode.next == null)

// tell successor that it can startqnode.next.locked = false;

Annette Bieniusa Replication and Consistency 61/ 76

Abortable Locks

What if you want to give up waiting for a lock?For example: timeout, transaction aborted by user, . . .

Simple for Backoff-LockJust return from lock() callNo cleanup, wait-free, immediate

Problematic for Queue LocksCan’t just quitThread in line behind will starve

Idea: Let successor deal with the problem!

⇒ Abortable CLH Lock

Annette Bieniusa Replication and Consistency 62/ 76

Abortable Locks

What if you want to give up waiting for a lock?For example: timeout, transaction aborted by user, . . .

Simple for Backoff-LockJust return from lock() callNo cleanup, wait-free, immediate

Problematic for Queue LocksCan’t just quitThread in line behind will starve

Idea: Let successor deal with the problem!

⇒ Abortable CLH Lock

Annette Bieniusa Replication and Consistency 62/ 76

Timeout Lock

Annette Bieniusa Replication and Consistency 63/ 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 64/ 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 65/ 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 66/ 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 67/ 76

Timeout Lock: Acquire

Annette Bieniusa Replication and Consistency 68/ 76

Timeout Lock: While waiting, . . .

Annette Bieniusa Replication and Consistency 69/ 76

Timeout Lock: Thread times out

Annette Bieniusa Replication and Consistency 70/ 76

Timeout Lock: Thread times out

Annette Bieniusa Replication and Consistency 71/ 76

Timeout Locks: Implementationclass TOLock

static Qnode AVAILABLE = new Qnode(); // signifies free lockAtomicReference<Qnode> tail;ThreadLocal<Qnode> myNode; // per thread

// Return value indicates successboolean lock(long timeout)

// Initialize nodeQnode qnode = new Qnode();myNode = qnode;qnode.prev = null;

// swap with tailQnode myPred = tail.getAndSet(qnode);

// if predecessor absent or released, we are doneif (myPred == null || myPred.prev == AVAILABLE)

return true;

...

Annette Bieniusa Replication and Consistency 72/ 76

Timeout Locks

...// Keep trying for a whilelong start = now();while (now()- start < timeout)

// Spin on predecessor's prev fieldQnode predPred = myPred.prev;if (predPred == AVAILABLE) // predecessor released lockreturn true;

else if (predPred != null) // predecessor aborted, we advance in queuemyPred = predPred;

...

Annette Bieniusa Replication and Consistency 73/ 76

Timeout Locks

...// In case timeout happened, we waited long enoughif (!tail.compareAndSet(qnode, myPred))

// If CAS fails, tell successor about my predecessorqnode.prev = myPred;

// If CAS succeeds, no successor, nothing to doreturn false;

Annette Bieniusa Replication and Consistency 74/ 76

Timeout Locks

void unlock() Qnode qnode = myNode.get();if (!tail.compareAndSet(qnode, null))

// If CAS failed: there is successor// Notify successor that it can enterqnode.prev = AVAILABLE;

// If CAS succeeds: no successor waiting// Set tail to null, no clean up

Annette Bieniusa Replication and Consistency 75/ 76

Summary: One Lock To Rule Them All?

TTAS+Backoff, CLH, MCS, ToLock . . .Each one better than others in some wayThere is no one solutionDecision really depends on:

the applicationthe hardwarewhich properties are important

Annette Bieniusa Replication and Consistency 76/ 76

top related