synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048...

49
Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism. SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329)

Upload: others

Post on 05-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism.

SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329)

Page 2: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

2

ISO C++ Users

NVIDIA GPU Engineers

WG21Memory Model

Community

[email protected]

My coordinates

Architects

Page 3: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

🛑 cudaDeviceSynchronize()

🛑 __syncthreads()

🛑 __shfl_sync()

✅ Using atomics to do blocking synchronization.

WHAT THIS TALK IS ABOUT:

Page 4: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

T0 →

T1 →

T2 →

T3 →

T4 →

T5 →

T6 →

Blocked

Blocked

Blocked

Blocked

Blocked

Blocked

Blocked

BlockedBlocked

SIGH!!

Blocked Dammit

Blocked

Blocked

Blocked

Still blocked

BlockedBlocked

All blocked and no play makes t6

a dull thread…

Blocked

Blocked

PSA: DON’T RUN SERIAL CODE IN THREADS

Page 5: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

T0 →

T1 →

T2 →

T3 →

T4 →

T5 →

T6 →

Blocked

Blocked

Blocked

Blocked

Blocked

Blocked

Blocked

PSA: RARE CONTENTION IS FINE

Page 6: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

struct mutex { // suspend atomic<> disbelief for now

__host__ __device__ void lock() {while(1 == l.exchange(1, memory_order_acquire));

}

__host__ __device__ void unlock() {l.store(0, memory_order_release);

}

atomic<int> l = ATOMIC_VAR_INIT(0);};

🎉 Thanks for attending my talk. 🎉

Awesome.

1.E+06

1.E+07

1.E+08

1.E+09

1 2 4 8 16 32 64 128 256 512 10242048

Cri

tica

l sec

tio

ns

(per

sec

on

d)

Thread Occupancy

V100 CPU

UNCONTENDED EXCHANGE LOCK

Page 7: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

7

Deadlock.

🐰🕳

Deadlock.

Deadlock.

Deadlock.

Page 8: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

😎 Atomic result feeds branch, closes loop, Volta+

😱 Atomic result feeds branch, closes loop

😬 Atomic result feeds branch, inside loop

🤨 Atomic result feeds branch, outside loop

🙂 Atomic result feeds arithmetic

😁 Atomic result ignored

👶 No atomics

SIMT ATOMIC CONCERN SCALE :

Page 9: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

9

SIMT FAMILY HISTORY

2007 2017

Tesla SIMTScalar thread programs.

Forward-progress = Nope ☹️

Volta SIMTScalar thread programs.

Forward-progress = YES! 🤘

Source: Wikipedia, SIGGRAPH proceedings, IEEE Micro.

1984

Pixar CHAPScalar channel programs.

1966

SIMD

1970

Time zero.

Page 10: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

10

APPLICABILITY

Page 11: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

11

CONs:

1. Serialization is bad.

2. Critical path / Amdahl’s law.

3. Latency is high.

PROs

1. Algorithmic gains.

2. Latency hiding.

3. Throughput is high

TL;DR: Sometimes, it’s a win.

SYNCHRONIZATION DECISION CHECKLIST

Page 12: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Keep local state in registers & shared memory, with synchronization.

20x faster for

RNN.

Grid0<<<>>>

Grid1<<<>>>

State Invalidation

Cooperative Grid

Global Barrier

APP #1: GPU-RESIDENT METHODS

See Greg Diamos’ GTC 2016 talk for more.

Page 13: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

// *continue* to suspend atomic<> disbelief for now__host__ __device__ bool lock_free_writer_version(atomic<int>& a, atomic<int>& b) {int expected = -1;if(a.compare_exchange_strong(expected, 1, memory_order_relaxed))b.store(1, memory_order_relaxed);

return expected == -1;}

// This version is a ~60% speedup at GPU application level, despite progress hazards.__host__ __device__ bool starvation_free_writer_version(atomic<int>& a, atomic<int>& b) {int expected_a = -1,

expected_b = -1;bool success_a = a.compare_exchange_strong(expected_a, 1, memory_order_relaxed),

success_b = b.compare_exchange_strong(expected_b, 1, memory_order_relaxed);if(success_a) // Note: we almost always succeed at both.

while(!success_b) // <-- This loop makes this a deadlock-free algorithm.success_b = b.compare_exchange_strong(expected_b = -1, 1, memory_order_relaxed);

else if(success_b)b.store(-1, memory_order_relaxed);

return expected_a == -1;}

Exposed dependent latency

Overlapped

Rarely-taken loop changes this algorithm to a different category.

APP #2: LOCK-FREE IS NOT ALWAYS FASTER

Page 14: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Even if mutexes hide in every node, GPUs can build tree structures fast.For more, see my CppCon 2018 talk on YouTube, and ‘Parallel Forall’ blog post.

Multi-threading (CPU)

Acceleration (RTX 2070)

APP #3: CONCURRENT DATA STRUCTURES

?

Page 15: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

15

PRE-REQUISITES

Page 16: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

→ Compute_7x.Compute_6x.

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th

international conference on Principles of Distributed Systems (OPODIS'11)

Every thread succeeds.

Some threadsucceeds.

No scheduling requirements.(Any thread scheduler.)

Eventually run isolated.

Critical sectionseventually complete.

Concurrentalgorithm

taxomomy.

App #2.

PR #1: FORWARD-PROGRESS

Page 17: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

ISO C++ 11 CUDA 9.0-10.2, Volta+ CUDA 10.3, Volta+

int atomic<int>::load(memory_order_seq_cst)asm("fence.sc.sys;");

asm("ld.acquire.sys.b32 %0, [%1];":::memory);int atomic<int>::load(memory_order_seq_cst)

int atomic<int>::load(memory_order_acquire) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire)

int atomic<int>::load(memory_order_relaxed)asm("ld.relaxed.sys.b32 %0, [%1];":::memory);

OR : x = *(volatile int*)ptr;int atomic<int>::load(memory_order_relaxed)

void atomic<int>::store(int, memory_order_seq_cst)asm("fence.sc.sys;");

asm("st.relaxed.sys.b32 [%0], %1;":::memory);void atomic<int>::store(int, memory_order_seq_cst)

void atomic<int>::store(int, memory_order_release) asm("st.release.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release)

void atomic<int>::store(int, memory_order_relaxed)asm("st.relaxed.sys.b32 [%0], %1;":::memory);

OR : *(volatile int*)ptr = x;void atomic<int>::store(int, memory_order_relaxed)

int atomic<int>::exchange(int, memory_order_seq_cst)asm("fence.sc.sys;");

asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory);int atomic<int>::exchange(int, memory_order_seq_cst)

int atomic<int>::exchange(int, memory_order_acq_rel) asm("atom.exch.acq_rel.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel)

int atomic<int>::exchange(int, memory_order_release) asm("atom.exch.release.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_release)

int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acquire)

int atomic<int>::exchange(int, memory_order_relaxed)asm("atom.exch.relaxed.sys.b32 %0, [%1], %2;":::memory);

OR: y = atomicExch_system(ptr, x);int atomic<int>::exchange(int, memory_order_relaxed)

And so on...

Classic CUDA C++.

🎉 Later this year! 🎉See PTX 6 chapter 8 for the asm.

PR #2: MEMORY CONSISTENCY

Our ASPLOS 2019 paper: https://github.com/NVlabs/ptxmemorymodel.

Page 18: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Platform / allocatorLoad/store sharingAtomic (low cont’n)

Atomic (high cont’n)

Any: ARM/Windows/Mac/Unmanaged Nope. Not at all.

x86 Linux (CPU/GPU) Managed Yes. Technically… but no.

x86 Linux (GPU/GPU) Managed YES! TRY IT!

POWER Linux (all pairs) Managed

• Concurrent data sharing between CPU and GPU is a new possibility.

• Real usefulness has some more conditions.

PR #3: TRUE SHARING

Page 19: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

19

PRELIMINARIES

Page 20: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

__host__ __device__ void test(int my_thread, int total_threads, int final_value) {for(int old ; my_thread < final_value; start += total_threads)

while(!a.compare_exchange_weak(old = my_thread, my_thread + 1, memory_order_relaxed))

;}

1.E-08

1.E-07

1.E-06

1.E-05

1 2 4 8 16 32 64 128 256 512 1024 2048

Late

ncy

(se

con

ds)

Contending threads (count)

V100 POWER X86

BW=1/LatNUMA is a punishing depressor of CPU perf. Bathtub curve is due to the statistical

likelihood of finding peer in pipeline.

Little’s Law finally kicks in.

CONTENTION IS THE ISSUE, DIFFERENTLY.

Page 21: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

01

24816

1.E-09

1.E-06

1.E-03

0

2

8

32

128

5122048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

Crushed? ½ millisecond

1.E-09

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1 2 4 8 16326412825651210242048

Late

ncy

(se

con

ds)

Thread Occupancy

V100 X86

CONTENDING PROCESSORS ARE CRUSHED…

Page 22: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

0124816

1.E-09

1.E-06

1.E-03

0

4

32

256

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

01248163264

1.E-09

1.E-06

1.E-03

0

4

32

256

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

x86 + V100 (PCIE) POWER + V100 (NVLINK)

…UNLESS THE PROCESSORS ARE NVLINK’ED.

Page 23: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

01248163264

1.E-07

1.E-06

1.E-05

0

4

32

256

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

ALL OF THE FOLLOWING SLIDES ARE NVLINK’ED.

And not log scale, because it’s legible in

linear scale now.Thanks.

Page 24: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

24

CONTENDED MUTEXES

Page 25: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

25

CONTENDED MUTEXESAS AN EXERCISE TO THINK ABOUT

THROUGHPUT AND FAIRNESS

Page 26: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

struct mutex {

__host__ __device__ void lock() {while(1 == l.exchange(1, memory_order_acquire));

}

__host__ __device__ void unlock() {l.store(0, memory_order_release);

}

atomic<int> l = ATOMIC_VAR_INIT(0);}; 0

2832

1.E-07

1.E-06

1.E-05

0

8

128

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

CONTENDED EXCHANGE LOCK

🎉 Stay. Keep attending my talk. 🎉

Not awesome.

Page 27: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

01

248163264

1.E-07

1.E-06

1.E-05

0

2

8

32

128

512

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

Heavy system pressure:• A lot of requests• Each request is slow

CONTENDED EXCHANGE LOCK

Page 28: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

• K bounds forecast relative error (orange line):Latencyresponse > Kdelay * Latencyimpulse• Pick arbitrary Kdelay; say 1.5 for 50% error.• Some benefit to stochastic choice, avoid coupling.

• Ceiling trades bandwidth & maximum error:tpolling / (latloaded + latbackoff) > BWpolling• Pick arbitrary BWpolling; say 0.5 * Bwcontended

• Floor protects the fast corner (green box):Latencyresponse > Latencyfloor• Minimum CPU sleep (Linux) is ~= 50000ns.• Minimum sleep on V100 is ~= 0ns.

Impulse-Response Diagram(seconds-seconds)

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

1.E-08 1.E-07 1.E-06 1.E-05 1.E-04

– Instantaneous response

BACKOFF : LESS PRESSURE VIA FORECASTING

K

Ceiling

Floor (CPU)

Floor (GPU)

Page 29: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

02

832

1.E-07

1.E-06

1.E-05

0

8

128

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

__host__ __device__ void lock() {uint32_t history = 1<<8; // 256nswhile(1 == l.exchange(1, memory_order_acquire)) {uint32_t delay = history >> 1; // 50%

#ifdef __CUDA_ARCH____nanosleep(delay);

#elseif(delay > (1<<15)) // 32usstd::this_thread::sleep_for(

std::chrono::nanoseconds(delay));else {std::this_thread::yield();delay = 0;

}#endif

history += (1<<8) + delay;if(history > (1<<18)) // 256ushistory = 1<<18;

}}

FLOOR

CEILING

K

We seem to have fixedthe slow corner.

CONTENDED EXCHANGE LOCK + BACKOFF

012

48163264

1.E-07

1.E-06

1.E-05

0

2

8

32

128

512

2048

Late

ncy

(se

con

ds)

Threads

(GPU x CPU)

Page 30: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

02

832

1.E-07

1.E-06

1.E-05

0

8

128

2048

Late

ncy

(se

con

ds)

Threads(GPU x CPU)

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

• Fast because: lock disproportionally granted to some threads.

• Slow because: top-level performance often depends on fairness.

Single-thread rate is a strong attractor.

Sect

ion

s

Threads

FAST LOCKS, SLOW APPLICATIONS 😱

Page 31: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th

international conference on Principles of Distributed Systems (OPODIS'11)

Some threadsucceeds.

Critical sectionseventually complete.

RECALL : FORWARD-PROGRESS

CONTENDED EXCHANGE LOCK WITH BACKOFF

Page 32: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Enumerated list:

1. Very low contention.

2. Top-level algorithms resilient to tail effects.

Luckily, this is still pretty common!

WHEN IS DEADLOCK-FREE SUITABLE?

Page 33: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

struct alignas(128) ticket_mutex {__host__ __device__ void lock() {

auto const my = in.fetch_add(1, memory_order_acquire);while(1) {

auto const now = out.load(memory_order_acquire);if(now == my) break;

auto const delta = my - now;auto const delay = (delta << 8); // * 256

#ifdef __CUDA_ARCH____nanosleep(delay);

#elseif(delay > (1<<15)) // 32us

std::this_thread::sleep_for(std::chrono::nanoseconds(delay));else

std::this_thread::yield();#endif

}}__host__ __device__ void unlock() {

out.fetch_add(1, memory_order_release);}atomic<unsigned> in = ATOMIC_VAR_INIT(0);atomic<unsigned> out = ATOMIC_VAR_INIT(0);

};

Don’t need either K or ceiling here, delta is an accurate forecast! ☺

02

832

1.E-07

1.E-06

1.E-05

0

4

32

256

2048Late

ncy

(se

con

ds)

Threads(GPU x CPU)

TICKET LOCK + PROPORTIONAL BACKOFF

Page 34: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

1.E-01

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

02

832

1.E-07

1.E-06

1.E-05

0

4

32

256

2048Late

ncy

(se

con

ds)

Threads(GPU x CPU)

TICKET LOCK + PROPORTIONAL BACKOFFSe

ctio

ns

Threads

Page 35: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th

international conference on Principles of Distributed Systems (OPODIS'11)

Critical sectionseventually complete.

AGAIN : FORWARD-PROGRESS

TICKET LOCK

Every thread succeeds.

Page 36: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

This is your default, when deadlock-free is unsuitable.

WHEN IS STARVATION-FREE SUITABLE?

Wish we could use queue locks (e.g. MCS) but we can’t.

These use O(P) storage 😱 and local stack pointers (MCS).

WHAT ELSE IS THERE FOR MUTEXES?

Page 37: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

37

BARRIERS

Page 38: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

38

BARRIERSAS A TYPICALLY-GPU THING

AND ALSO TO THINK ABOUT LATENCY

Page 39: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

__host__ __device__ void arrive_and_wait() {

auto const _expected = expected;auto const old = phase_arrived.fetch_add(1, memory_order_acq_rel);auto current = old + 1;if((old & phase_bit) != (current & phase_bit)) {

phase_arrived.fetch_add(phase_bit - _expected);}else

while(1) {current = phase_arrived.load(memory_order_acquire);if((old & phase_bit) != (current & phase_bit))break;

auto const delta = phase_bit - (current & ~phase_bit);auto const delay = (delta << 8); // * 256

#ifdef __CUDA_ARCH____nanosleep(delay);

#elseif(delay > (1<<15)) // 32us

std::this_thread::sleep_for(std::chrono::nanoseconds(delay));else

std::this_thread::yield();#endif

}}uint32_t const expected = 0;atomic<uint32_t> phase_arrived = ATOMIC_VAR_INIT(0);

CENTRAL BARRIER + PROPORTIONAL BACKOFF

02

832

1.E-07

1.E-06

1.E-05

0

8

128

2048Late

ncy

(se

con

ds)

Threads(GPU x CPU)

Page 40: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

02

832

1.E-07

1.E-06

1.E-05

0

8

128

2048Late

ncy

(se

con

ds)

Threads(GPU x CPU)

CENTRAL BARRIER + PROPORTIONAL BACKOFF

• Centralized barrier is bad for the CPU.• Coherence protocols strongly prefer

fancy barrier algorithms: tree, tournament, dissemination…

• Because: BWcontended = 1/LatNUMA.

• GPU just hangs-on for a while longer.• But: fancy algorithms introduce high-

latency, levels of indirection.• Each indirection needs 1:100x .. 1:1000x

improvement in BW to justify itself.

Page 41: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

EASY AND EFFECTIVE GPU TREE BARRIER

02

832

1.E-07

1.E-06

1.E-05

0

8

128

2048Late

ncy

(se

con

ds)

Threads(GPU x CPU)

__host__ __device__ void arrive_and_wait() {#ifdef __CUDA_ARCH__

auto const c = __syncthreads_count(1);if(threadIdx.x == 0)

__arrive_and_wait(c);__syncthreads();

#else__arrive_and_wait();

#endif // __CUDA_ARCH__}

__host__ __device__ void __arrive_and_wait(int c = 1) {

auto const _expected = expected;auto const old = phase_arrived.fetch_add(c, memory_order_acq_rel);auto current = old + c;

//...

}

• 2nd level of hierarchy is ~free, in blocks.

• Up to 1:1024 bandwidth reduction!

Page 42: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

42

“Remember, if you actually need

a GPU barrier, then you should

use cooperative groups instead.”

- My inner CUDA engineer voice.

https://devblogs.nvidia.com/cooperative-groups/

Page 43: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

WHAT ABOUT CPU-GPU BARRIERS, THOUGH?

•As you can see, a new barrier algorithm is necessary.

•Perhaps partitioned strategies, by processor type?

Seriously, I’m asking. Somebody should try it! 😅

• I don’t know what it would be for, though. So no rush.

Page 44: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

For multi-GPU systems:• You can replicate arrivals to trade atomics vs. polling.•Not done by CG but it has been done at NVIDIA.

For a DGX-2 (2.6 million threads):• You might benefit from 3rd level of barrier, barely.• I don’t think it’s been done at NVIDIA yet.

WHAT ELSE IS THERE FOR BARRIERS?

Page 45: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

45

IN SHORT

Page 46: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

→ Compute_7x.

Critical sectionseventually complete.

1. Contention bandwidth is a major issue for synchronization.See: atomic story.

2. If you use back-offs, keep an eye on fairness.See: mutex story.

3. If you use indirection, the GPU needs a 100..1000x saving.See: barrier story.

USE CASES PRE-REQS KEEP IN MIND

Page 47: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Should come to the CUDA C++ toolkit this year, in 2019.

A preview is here:https://github.com/ogiroux/freestanding.

My CppCon 2018 talk has more, stream it on YouTube.

CUDA::STD::ATOMIC<T> IS COMING SOON

Page 48: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads

Concurrency at this scale has never been easier.

If you have IBM + V100 systems, try new algorithms!

We want to see what you’ll do with them.

EXTREME SHARED-MEMORY CONCURRENCY

Page 49: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads