synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048...
TRANSCRIPT
![Page 1: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/1.jpg)
Olivier Giroux, Distinguished Architect, ISO C++ Chair of Concurrency & Parallelism.
SYNCHRONIZATION IS BAD, BUT IF YOU MUST… (S9329)
![Page 2: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/2.jpg)
2
ISO C++ Users
NVIDIA GPU Engineers
WG21Memory Model
Community
My coordinates
Architects
![Page 3: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/3.jpg)
🛑 cudaDeviceSynchronize()
🛑 __syncthreads()
🛑 __shfl_sync()
✅ Using atomics to do blocking synchronization.
WHAT THIS TALK IS ABOUT:
![Page 4: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/4.jpg)
T0 →
T1 →
T2 →
T3 →
T4 →
T5 →
T6 →
Blocked
Blocked
Blocked
Blocked
Blocked
Blocked
Blocked
BlockedBlocked
SIGH!!
Blocked Dammit
Blocked
Blocked
Blocked
Still blocked
BlockedBlocked
All blocked and no play makes t6
a dull thread…
Blocked
Blocked
PSA: DON’T RUN SERIAL CODE IN THREADS
![Page 5: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/5.jpg)
T0 →
T1 →
T2 →
T3 →
T4 →
T5 →
T6 →
Blocked
Blocked
Blocked
Blocked
Blocked
Blocked
Blocked
PSA: RARE CONTENTION IS FINE
![Page 6: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/6.jpg)
struct mutex { // suspend atomic<> disbelief for now
__host__ __device__ void lock() {while(1 == l.exchange(1, memory_order_acquire));
}
__host__ __device__ void unlock() {l.store(0, memory_order_release);
}
atomic<int> l = ATOMIC_VAR_INIT(0);};
🎉 Thanks for attending my talk. 🎉
Awesome.
1.E+06
1.E+07
1.E+08
1.E+09
1 2 4 8 16 32 64 128 256 512 10242048
Cri
tica
l sec
tio
ns
(per
sec
on
d)
Thread Occupancy
V100 CPU
UNCONTENDED EXCHANGE LOCK
![Page 7: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/7.jpg)
7
Deadlock.
🐰🕳
Deadlock.
Deadlock.
Deadlock.
![Page 8: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/8.jpg)
😎 Atomic result feeds branch, closes loop, Volta+
😱 Atomic result feeds branch, closes loop
😬 Atomic result feeds branch, inside loop
🤨 Atomic result feeds branch, outside loop
🙂 Atomic result feeds arithmetic
😁 Atomic result ignored
👶 No atomics
SIMT ATOMIC CONCERN SCALE :
![Page 9: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/9.jpg)
9
SIMT FAMILY HISTORY
2007 2017
Tesla SIMTScalar thread programs.
Forward-progress = Nope ☹️
Volta SIMTScalar thread programs.
Forward-progress = YES! 🤘
Source: Wikipedia, SIGGRAPH proceedings, IEEE Micro.
1984
Pixar CHAPScalar channel programs.
1966
SIMD
1970
Time zero.
![Page 10: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/10.jpg)
10
APPLICABILITY
![Page 11: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/11.jpg)
11
CONs:
1. Serialization is bad.
2. Critical path / Amdahl’s law.
3. Latency is high.
PROs
1. Algorithmic gains.
2. Latency hiding.
3. Throughput is high
TL;DR: Sometimes, it’s a win.
SYNCHRONIZATION DECISION CHECKLIST
![Page 12: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/12.jpg)
Keep local state in registers & shared memory, with synchronization.
20x faster for
RNN.
Grid0<<<>>>
Grid1<<<>>>
State Invalidation
Cooperative Grid
Global Barrier
APP #1: GPU-RESIDENT METHODS
See Greg Diamos’ GTC 2016 talk for more.
![Page 13: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/13.jpg)
// *continue* to suspend atomic<> disbelief for now__host__ __device__ bool lock_free_writer_version(atomic<int>& a, atomic<int>& b) {int expected = -1;if(a.compare_exchange_strong(expected, 1, memory_order_relaxed))b.store(1, memory_order_relaxed);
return expected == -1;}
// This version is a ~60% speedup at GPU application level, despite progress hazards.__host__ __device__ bool starvation_free_writer_version(atomic<int>& a, atomic<int>& b) {int expected_a = -1,
expected_b = -1;bool success_a = a.compare_exchange_strong(expected_a, 1, memory_order_relaxed),
success_b = b.compare_exchange_strong(expected_b, 1, memory_order_relaxed);if(success_a) // Note: we almost always succeed at both.
while(!success_b) // <-- This loop makes this a deadlock-free algorithm.success_b = b.compare_exchange_strong(expected_b = -1, 1, memory_order_relaxed);
else if(success_b)b.store(-1, memory_order_relaxed);
return expected_a == -1;}
Exposed dependent latency
Overlapped
Rarely-taken loop changes this algorithm to a different category.
APP #2: LOCK-FREE IS NOT ALWAYS FASTER
![Page 14: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/14.jpg)
Even if mutexes hide in every node, GPUs can build tree structures fast.For more, see my CppCon 2018 talk on YouTube, and ‘Parallel Forall’ blog post.
Multi-threading (CPU)
Acceleration (RTX 2070)
APP #3: CONCURRENT DATA STRUCTURES
?
![Page 15: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/15.jpg)
15
PRE-REQUISITES
![Page 16: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/16.jpg)
→ Compute_7x.Compute_6x.
Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th
international conference on Principles of Distributed Systems (OPODIS'11)
Every thread succeeds.
Some threadsucceeds.
No scheduling requirements.(Any thread scheduler.)
Eventually run isolated.
Critical sectionseventually complete.
Concurrentalgorithm
taxomomy.
App #2.
PR #1: FORWARD-PROGRESS
![Page 17: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/17.jpg)
ISO C++ 11 CUDA 9.0-10.2, Volta+ CUDA 10.3, Volta+
int atomic<int>::load(memory_order_seq_cst)asm("fence.sc.sys;");
asm("ld.acquire.sys.b32 %0, [%1];":::memory);int atomic<int>::load(memory_order_seq_cst)
int atomic<int>::load(memory_order_acquire) asm("ld.acquire.sys.b32 %0, [%1];":::memory); int atomic<int>::load(memory_order_acquire)
int atomic<int>::load(memory_order_relaxed)asm("ld.relaxed.sys.b32 %0, [%1];":::memory);
OR : x = *(volatile int*)ptr;int atomic<int>::load(memory_order_relaxed)
void atomic<int>::store(int, memory_order_seq_cst)asm("fence.sc.sys;");
asm("st.relaxed.sys.b32 [%0], %1;":::memory);void atomic<int>::store(int, memory_order_seq_cst)
void atomic<int>::store(int, memory_order_release) asm("st.release.sys.b32 [%0], %1;":::memory); void atomic<int>::store(int, memory_order_release)
void atomic<int>::store(int, memory_order_relaxed)asm("st.relaxed.sys.b32 [%0], %1;":::memory);
OR : *(volatile int*)ptr = x;void atomic<int>::store(int, memory_order_relaxed)
int atomic<int>::exchange(int, memory_order_seq_cst)asm("fence.sc.sys;");
asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory);int atomic<int>::exchange(int, memory_order_seq_cst)
int atomic<int>::exchange(int, memory_order_acq_rel) asm("atom.exch.acq_rel.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acq_rel)
int atomic<int>::exchange(int, memory_order_release) asm("atom.exch.release.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_release)
int atomic<int>::exchange(int, memory_order_acquire) asm("atom.exch.acquire.sys.b32 %0, [%1], %2;":::memory); int atomic<int>::exchange(int, memory_order_acquire)
int atomic<int>::exchange(int, memory_order_relaxed)asm("atom.exch.relaxed.sys.b32 %0, [%1], %2;":::memory);
OR: y = atomicExch_system(ptr, x);int atomic<int>::exchange(int, memory_order_relaxed)
And so on...
Classic CUDA C++.
🎉 Later this year! 🎉See PTX 6 chapter 8 for the asm.
PR #2: MEMORY CONSISTENCY
Our ASPLOS 2019 paper: https://github.com/NVlabs/ptxmemorymodel.
![Page 18: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/18.jpg)
Platform / allocatorLoad/store sharingAtomic (low cont’n)
Atomic (high cont’n)
Any: ARM/Windows/Mac/Unmanaged Nope. Not at all.
x86 Linux (CPU/GPU) Managed Yes. Technically… but no.
x86 Linux (GPU/GPU) Managed YES! TRY IT!
POWER Linux (all pairs) Managed
• Concurrent data sharing between CPU and GPU is a new possibility.
• Real usefulness has some more conditions.
PR #3: TRUE SHARING
![Page 19: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/19.jpg)
19
PRELIMINARIES
![Page 20: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/20.jpg)
__host__ __device__ void test(int my_thread, int total_threads, int final_value) {for(int old ; my_thread < final_value; start += total_threads)
while(!a.compare_exchange_weak(old = my_thread, my_thread + 1, memory_order_relaxed))
;}
1.E-08
1.E-07
1.E-06
1.E-05
1 2 4 8 16 32 64 128 256 512 1024 2048
Late
ncy
(se
con
ds)
Contending threads (count)
V100 POWER X86
BW=1/LatNUMA is a punishing depressor of CPU perf. Bathtub curve is due to the statistical
likelihood of finding peer in pipeline.
Little’s Law finally kicks in.
CONTENTION IS THE ISSUE, DIFFERENTLY.
![Page 21: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/21.jpg)
01
24816
1.E-09
1.E-06
1.E-03
0
2
8
32
128
5122048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
Crushed? ½ millisecond
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1 2 4 8 16326412825651210242048
Late
ncy
(se
con
ds)
Thread Occupancy
V100 X86
CONTENDING PROCESSORS ARE CRUSHED…
![Page 22: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/22.jpg)
0124816
1.E-09
1.E-06
1.E-03
0
4
32
256
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
01248163264
1.E-09
1.E-06
1.E-03
0
4
32
256
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
x86 + V100 (PCIE) POWER + V100 (NVLINK)
…UNLESS THE PROCESSORS ARE NVLINK’ED.
![Page 23: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/23.jpg)
01248163264
1.E-07
1.E-06
1.E-05
0
4
32
256
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
ALL OF THE FOLLOWING SLIDES ARE NVLINK’ED.
And not log scale, because it’s legible in
linear scale now.Thanks.
![Page 24: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/24.jpg)
24
CONTENDED MUTEXES
![Page 25: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/25.jpg)
25
CONTENDED MUTEXESAS AN EXERCISE TO THINK ABOUT
THROUGHPUT AND FAIRNESS
![Page 26: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/26.jpg)
struct mutex {
__host__ __device__ void lock() {while(1 == l.exchange(1, memory_order_acquire));
}
__host__ __device__ void unlock() {l.store(0, memory_order_release);
}
atomic<int> l = ATOMIC_VAR_INIT(0);}; 0
2832
1.E-07
1.E-06
1.E-05
0
8
128
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
CONTENDED EXCHANGE LOCK
🎉 Stay. Keep attending my talk. 🎉
Not awesome.
![Page 27: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/27.jpg)
01
248163264
1.E-07
1.E-06
1.E-05
0
2
8
32
128
512
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
Heavy system pressure:• A lot of requests• Each request is slow
CONTENDED EXCHANGE LOCK
![Page 28: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/28.jpg)
• K bounds forecast relative error (orange line):Latencyresponse > Kdelay * Latencyimpulse• Pick arbitrary Kdelay; say 1.5 for 50% error.• Some benefit to stochastic choice, avoid coupling.
• Ceiling trades bandwidth & maximum error:tpolling / (latloaded + latbackoff) > BWpolling• Pick arbitrary BWpolling; say 0.5 * Bwcontended
• Floor protects the fast corner (green box):Latencyresponse > Latencyfloor• Minimum CPU sleep (Linux) is ~= 50000ns.• Minimum sleep on V100 is ~= 0ns.
Impulse-Response Diagram(seconds-seconds)
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-08 1.E-07 1.E-06 1.E-05 1.E-04
– Instantaneous response
BACKOFF : LESS PRESSURE VIA FORECASTING
K
Ceiling
Floor (CPU)
Floor (GPU)
![Page 29: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/29.jpg)
02
832
1.E-07
1.E-06
1.E-05
0
8
128
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
__host__ __device__ void lock() {uint32_t history = 1<<8; // 256nswhile(1 == l.exchange(1, memory_order_acquire)) {uint32_t delay = history >> 1; // 50%
#ifdef __CUDA_ARCH____nanosleep(delay);
#elseif(delay > (1<<15)) // 32usstd::this_thread::sleep_for(
std::chrono::nanoseconds(delay));else {std::this_thread::yield();delay = 0;
}#endif
history += (1<<8) + delay;if(history > (1<<18)) // 256ushistory = 1<<18;
}}
FLOOR
CEILING
K
We seem to have fixedthe slow corner.
CONTENDED EXCHANGE LOCK + BACKOFF
012
48163264
1.E-07
1.E-06
1.E-05
0
2
8
32
128
512
2048
Late
ncy
(se
con
ds)
Threads
(GPU x CPU)
![Page 30: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/30.jpg)
02
832
1.E-07
1.E-06
1.E-05
0
8
128
2048
Late
ncy
(se
con
ds)
Threads(GPU x CPU)
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
• Fast because: lock disproportionally granted to some threads.
• Slow because: top-level performance often depends on fairness.
Single-thread rate is a strong attractor.
Sect
ion
s
Threads
FAST LOCKS, SLOW APPLICATIONS 😱
![Page 31: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/31.jpg)
Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th
international conference on Principles of Distributed Systems (OPODIS'11)
Some threadsucceeds.
Critical sectionseventually complete.
RECALL : FORWARD-PROGRESS
CONTENDED EXCHANGE LOCK WITH BACKOFF
![Page 32: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/32.jpg)
Enumerated list:
1. Very low contention.
2. Top-level algorithms resilient to tail effects.
Luckily, this is still pretty common!
WHEN IS DEADLOCK-FREE SUITABLE?
![Page 33: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/33.jpg)
struct alignas(128) ticket_mutex {__host__ __device__ void lock() {
auto const my = in.fetch_add(1, memory_order_acquire);while(1) {
auto const now = out.load(memory_order_acquire);if(now == my) break;
auto const delta = my - now;auto const delay = (delta << 8); // * 256
#ifdef __CUDA_ARCH____nanosleep(delay);
#elseif(delay > (1<<15)) // 32us
std::this_thread::sleep_for(std::chrono::nanoseconds(delay));else
std::this_thread::yield();#endif
}}__host__ __device__ void unlock() {
out.fetch_add(1, memory_order_release);}atomic<unsigned> in = ATOMIC_VAR_INIT(0);atomic<unsigned> out = ATOMIC_VAR_INIT(0);
};
Don’t need either K or ceiling here, delta is an accurate forecast! ☺
02
832
1.E-07
1.E-06
1.E-05
0
4
32
256
2048Late
ncy
(se
con
ds)
Threads(GPU x CPU)
TICKET LOCK + PROPORTIONAL BACKOFF
![Page 34: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/34.jpg)
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
02
832
1.E-07
1.E-06
1.E-05
0
4
32
256
2048Late
ncy
(se
con
ds)
Threads(GPU x CPU)
TICKET LOCK + PROPORTIONAL BACKOFFSe
ctio
ns
Threads
![Page 35: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/35.jpg)
Maurice Herlihy and Nir Shavit. 2011. On the nature of progress. In Proceedings of the 15th
international conference on Principles of Distributed Systems (OPODIS'11)
Critical sectionseventually complete.
AGAIN : FORWARD-PROGRESS
TICKET LOCK
Every thread succeeds.
![Page 36: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/36.jpg)
This is your default, when deadlock-free is unsuitable.
WHEN IS STARVATION-FREE SUITABLE?
Wish we could use queue locks (e.g. MCS) but we can’t.
These use O(P) storage 😱 and local stack pointers (MCS).
WHAT ELSE IS THERE FOR MUTEXES?
![Page 37: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/37.jpg)
37
BARRIERS
![Page 38: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/38.jpg)
38
BARRIERSAS A TYPICALLY-GPU THING
AND ALSO TO THINK ABOUT LATENCY
![Page 39: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/39.jpg)
__host__ __device__ void arrive_and_wait() {
auto const _expected = expected;auto const old = phase_arrived.fetch_add(1, memory_order_acq_rel);auto current = old + 1;if((old & phase_bit) != (current & phase_bit)) {
phase_arrived.fetch_add(phase_bit - _expected);}else
while(1) {current = phase_arrived.load(memory_order_acquire);if((old & phase_bit) != (current & phase_bit))break;
auto const delta = phase_bit - (current & ~phase_bit);auto const delay = (delta << 8); // * 256
#ifdef __CUDA_ARCH____nanosleep(delay);
#elseif(delay > (1<<15)) // 32us
std::this_thread::sleep_for(std::chrono::nanoseconds(delay));else
std::this_thread::yield();#endif
}}uint32_t const expected = 0;atomic<uint32_t> phase_arrived = ATOMIC_VAR_INIT(0);
CENTRAL BARRIER + PROPORTIONAL BACKOFF
02
832
1.E-07
1.E-06
1.E-05
0
8
128
2048Late
ncy
(se
con
ds)
Threads(GPU x CPU)
![Page 40: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/40.jpg)
02
832
1.E-07
1.E-06
1.E-05
0
8
128
2048Late
ncy
(se
con
ds)
Threads(GPU x CPU)
CENTRAL BARRIER + PROPORTIONAL BACKOFF
• Centralized barrier is bad for the CPU.• Coherence protocols strongly prefer
fancy barrier algorithms: tree, tournament, dissemination…
• Because: BWcontended = 1/LatNUMA.
• GPU just hangs-on for a while longer.• But: fancy algorithms introduce high-
latency, levels of indirection.• Each indirection needs 1:100x .. 1:1000x
improvement in BW to justify itself.
![Page 41: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/41.jpg)
EASY AND EFFECTIVE GPU TREE BARRIER
02
832
1.E-07
1.E-06
1.E-05
0
8
128
2048Late
ncy
(se
con
ds)
Threads(GPU x CPU)
__host__ __device__ void arrive_and_wait() {#ifdef __CUDA_ARCH__
auto const c = __syncthreads_count(1);if(threadIdx.x == 0)
__arrive_and_wait(c);__syncthreads();
#else__arrive_and_wait();
#endif // __CUDA_ARCH__}
__host__ __device__ void __arrive_and_wait(int c = 1) {
auto const _expected = expected;auto const old = phase_arrived.fetch_add(c, memory_order_acq_rel);auto current = old + c;
//...
}
• 2nd level of hierarchy is ~free, in blocks.
• Up to 1:1024 bandwidth reduction!
![Page 42: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/42.jpg)
42
“Remember, if you actually need
a GPU barrier, then you should
use cooperative groups instead.”
- My inner CUDA engineer voice.
https://devblogs.nvidia.com/cooperative-groups/
![Page 43: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/43.jpg)
WHAT ABOUT CPU-GPU BARRIERS, THOUGH?
•As you can see, a new barrier algorithm is necessary.
•Perhaps partitioned strategies, by processor type?
Seriously, I’m asking. Somebody should try it! 😅
• I don’t know what it would be for, though. So no rush.
![Page 44: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/44.jpg)
For multi-GPU systems:• You can replicate arrivals to trade atomics vs. polling.•Not done by CG but it has been done at NVIDIA.
For a DGX-2 (2.6 million threads):• You might benefit from 3rd level of barrier, barely.• I don’t think it’s been done at NVIDIA yet.
WHAT ELSE IS THERE FOR BARRIERS?
![Page 45: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/45.jpg)
45
IN SHORT
![Page 46: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/46.jpg)
→ Compute_7x.
Critical sectionseventually complete.
1. Contention bandwidth is a major issue for synchronization.See: atomic story.
2. If you use back-offs, keep an eye on fairness.See: mutex story.
3. If you use indirection, the GPU needs a 100..1000x saving.See: barrier story.
USE CASES PRE-REQS KEEP IN MIND
![Page 47: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/47.jpg)
Should come to the CUDA C++ toolkit this year, in 2019.
A preview is here:https://github.com/ogiroux/freestanding.
My CppCon 2018 talk has more, stream it on YouTube.
CUDA::STD::ATOMIC<T> IS COMING SOON
![Page 48: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/48.jpg)
Concurrency at this scale has never been easier.
If you have IBM + V100 systems, try new algorithms!
We want to see what you’ll do with them.
EXTREME SHARED-MEMORY CONCURRENCY
![Page 49: Synchronization is bad, but if you must… · 2 1 8 4 16 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads (gpu x cpu) 1 0 4 2 16 8 64 32 1.e-09 1.e-06 1.e-03 0 4 32 256 s) 2048 threads](https://reader034.vdocuments.us/reader034/viewer/2022042406/5f1fa44eb3f4bb6fcc3ef73a/html5/thumbnails/49.jpg)