the performance of spin lock alternatives for shared-memory multiprocessors

(of 23)

MULT

IVIE

WThe Performance of Spin Lock Alternatives for Shared-

Memory MultiprocessorsPaper: Thomas E. Anderson

Presentation: Emerson Murphy-Hill

(of 23)

MULT

IVIE

W

Introduction• Shared Memory Multiprocessors• Mutual exclusion required• Almost always hardware primitives provided

– Direct mutual exclusion– Mutual exclusion through locking

• Interest here: short critical regions, spin locks• The problem: spinning processors cost

communication bandwidth – how can we cut it?

(of 23)

MULT

IVIE

W

Range of Architectures• Two dimensions:

– Interconnect type (multistage network or bus)– Cache type

• So six architectures considered:– Multistage network without private caches– Multistage network, invalidation based cache coherence using

RD– Bus without coherent private cache– Bus w/snoopy write through invalidation-based cache coherence – Bus with snoopy write-back invalidation based cache coherence– Bus with snoopy distributed write cache coherence

• Architectures generally read, modify, and write atomically

(of 23)

MULT

IVIE

W

Why Spinlocks are Slow• Tradeoff: frequent polling gets you

the lock faster, but slows everyone else down

• Latency is an issue: some overhead for complicated spinlock algorithm

(of 23)

MULT

IVIE

W

A Spin-Waiting Algorithm• Spin on Test-and-Set

while(TestAndSet(lock) = BUSY);<criticial section>Lock := CLEAR;

• Slow, because:– Lock holder must content with non-lock

holders– Spinning requests slow other requests

(of 23)

MULT

IVIE

W

Another Spin-Waiting Algorithm

• Spin on Read (Test-and-Test-and-Set)while(lock=BUSY or TestAndSet(lock)=BUSY);<criticial section>lock := CLEAR;

• For architectures with per-processor cache

• Like previous, but no network/bus communication on read

• For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates

(of 23)

MULT

IVIE

W

Reasons Why Quiescence is Slow

• Elapsed time between Read and Test-and-Set

• All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails

• Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!)

(of 23)

MULT

IVIE

W

Validation

(of 23)

MULT

IVIE

W

Validation (a bit more)

(of 23)

MULT

IVIE

W

Now, Speed it Up…• Author presents 5 alternative

approaches• Interesting approach – 4 are based

on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols

(of 23)

MULT

IVIE

W

1/5: Static Delay on Lock Release

• When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set

• Each processor is assigned a static delay (slot)

• Good performance:– Fewer slots, fewer spinning processors– Many slots, more spinning processors

(of 23)

MULT

IVIE

W

2/5: Backoff on Lock Release

• Like Ethernet backoff• Wait a small amount of time between

Read and Test-and-Set• If processor collides with another

processor, it backs off for a greater random interval

• Indirectly, processors base backoff interval on the number of spinning processors

• But…

(of 23)

MULT

IVIE

W

More on Backoff…• Processors should not change their

mean delay if another processor acquires the lock

• Maximum time to delay should be bounded

• Initial delay on arrival should be a fraction of the last delay

(of 23)

MULT

IVIE

W

3/5: Static Delay before Reference

while(lock=BUSY or TestAndSet(lock)=BUSY)delay();

<criticial section>

• Here you just check the lock less often

• Good when:– Checking frequently, and few other

spinners– Checking infrequently, many spinners

(of 23)

MULT

IVIE

W

4/5: Backoff before Reference

while(lock=BUSY or TestAndSet(lock)=BUSY)delay();delay += randomBackoff();

<criticial section>

• Analogous to backoff on lock release• Both dynamic and static backoff are

bad when the critical section is long: they just keep backing off while the lock is being held

(of 23)

MULT

IVIE

W

5/5: Queue• Can’t estimate backoff by number of waiting

processes, can’t keep a process queue (just as slow as the lock!)

• This author’s contribution (finally):

Init flags[0] := HAS_LOCK;flags[1..P-1] := MUST_WAIT;queueLast := 0;

Lock myPlace := ReadAndIncrement(queueLast);while(flags[myPlace mod P]=MUST_WAIT);<critical section>

Unlock flags[myPlace mod P] := MUST_WAIT;flags[(myPlace+1) mod P] := HAS_LOCK;

(of 23)

MULT

IVIE

W

More on Queuing• Works especially well for multistage

networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests

• Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place

• Lock latency is increased (overhead), so poor performance when there’s no contention

(of 23)

MULT

IVIE

W

Benchmark Spin-lock Alternatives

(of 23)

MULT

IVIE

W

Overhead vs. Number of Slots

(of 23)

MULT

IVIE

W

Spin-waiting Overhead for a Burst

(of 23)

MULT

IVIE

W

Network Hardware Solutions• Combining Networks

– Multiple paths to same memory location• Hardware Queuing

– Eliminates polling across the network• Goodman’s Queue Links

– Stores the name of the next processor in the queue directly in each processor’s cache

– Eliminates need for memory access for queuing

(of 23)

MULT

IVIE

W

Bus Hardware Solutions• Invalidate cache copies ONLY when Test-

and-Set succeeds• Read broadcast

– Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback)

– Eliminates the cascade of read-misses• Special handling of Test-and-Set

– Cache and bus controllers don’t mess with the bus if the lock is busy

– Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail

(of 23)

MULT

IVIE

W

Conclusions• Spin-locking performance doesn’t scale• A variant of Ethernet backoff has good

results when there is little lock contention

• Queuing (parallelizing lock handoff) has good results when there are waiting processors

• A little supportive hardware goes a long way towards a healthy multiprocessor relationship

the performance of spin lock alternatives for shared-memory multiprocessors

Documents