cuda lecture 7 cuda threads and atomics
DESCRIPTION
CUDA Lecture 7 CUDA Threads and Atomics. Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Topic 1: Atomics. The Problem: how do you do global communication? Finish a kernel and start a new one All writes from all threads complete before a kernel finishes - PowerPoint PPT PresentationTRANSCRIPT
Prepared 8/8/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 7CUDA Threads and
Atomics
The Problem: how do you do global communication?Finish a kernel and start a new oneAll writes from all threads complete before a kernel
finishes
Would need to decompose kernels into before and after parts
CUDA Threads and Atomics – Slide 2
Topic 1: Atomics
step1<<<grid1,blk1>>>(…);// The system ensures that all// writes from step 1 completestep2<<<grid2,blk2>>>(…);
Alternately write to a predefined memory locationRace condition! Updates can be lost
What is the value of a in thread 0? In thread 1917?
CUDA Threads and Atomics – Slide 3
Race Conditions
threadID: 0 threadID: 1917// vector[0] was equal to zero
vector[0] += 5;…a = vector[0];
vector[0] += 1;…a = vector[0];
Thread 0 could have finished execution before 1917 started
Or the other way aroundOr both are executing at the same timeAnswer: not defined by the programming model; can
be arbitraryCUDA Threads and Atomics – Slide 4
Race Conditions (cont.)threadID: 0 threadID: 1917
// vector[0] was equal to zerovector[0] += 5;…a = vector[0];
vector[0] += 1;…a = vector[0];
CUDA provides atomic operations to deal with this problemAn atomic operation guarantees that only a
single thread has access to a piece of memory while an operation completes
The name atomic comes from the fact that it is uninterruptable
No dropped data, but ordering is still arbitrary
CUDA Threads and Atomics – Slide 5
Atomics
CUDA provides atomic operations to deal with this problemRequires hardware with compute capability 1.1
and aboveDifferent types of atomic instructions
Addition/subtraction: atomicAdd, atomicSubMinimum/maximum: atomicMin, atomicMaxConditional increment/decrement: atomicInc, atomicDec
Exchange/compare-and-swap: atomicExch, atomicCAS
More types in fermi: atomicAnd, atomicOr, atomicXor
CUDA Threads and Atomics – Slide 6
Atomics (cont.)
CUDA Threads and Atomics – Slide 7
Example: Histogram// Determine frequency of colors in a picture// colors have already been converted into integers// Each thread looks at one pixel and increments// a counter atomically__global__ void histogram(int* color, int* buckets){ int i = threadIdx.x + blockDim.x * blockIdx.x; int c = colors[i]; atomicAdd(&buckets[c], 1);}atomicAdd returns the previous value at a
certain addressUseful for grabbing variable amounts of data
from the list
CUDA Threads and Atomics – Slide 8
Example: Workqueue// For algorithms where the amount of work per item// is highly non-uniform, it often makes sense for// to continuously grab work from a queue __global__ void workq(int* work_q, int* q_counter, int* output, int queue_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int q_index = atomicInc(q_counter, queue_max); int result = do_work(work_q[q_index]); output[i] = result;}
if compare equals old value stored at address then val is stored instead
in either case, routine returns the value of old
seems a bizarre routine at first sight, but can be very useful for atomic locks
CUDA Threads and Atomics – Slide 9
Compare and Swapint atomicCAS(int* address, int compare, int val)
Most general type of atomicCan emulate all others with CAS
CUDA Threads and Atomics – Slide 10
Compare and Swap (cont.)int atomicCAS(int* address, int oldval, int val) { int old_reg_val = *address; if (old_reg_val == compare) *address = val; return old_reg_val; }
Atomics are slower than normal load/storeMost of these are associative operations on
signed/unsigned integers:quite fast for data in shared memoryslower for data in device memory
You can have the whole machine queuing on a single location in memory
Atomics unavailable on G80!
CUDA Threads and Atomics – Slide 11
Atomics
CUDA Threads and Atomics – Slide 12
Example: Global Min/Max (Naïve)
// If you require the maximum across all threads// in a grid, you could do it with a single global// maximum value, but it will be VERY slow__global__ void global_max(int* values, int* gl_max){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val);}
CUDA Threads and Atomics – Slide 13
Example: Global Min/Max (Better)
// introduce intermediate maximum results, so that // most threads do not try to update the global max__global__ void global_max(int* values, int* max, int *regional_maxes, int num_regions){ int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; int region = i % num_regions; if (atomicMax(®_max[region],val) < val) { atomicMax(max,val); }}
Single value causes serial bottleneckCreate hierarchy of values for more
parallelismPerformance will still be slow, so use
judiciously
CUDA Threads and Atomics – Slide 14
Global Max/Min
Can’t use normal load/store for inter-thread communication because of race conditions
Use atomic instructions for sparse and/or unpredictable global communication
Decompose data (very limited use of single global sum/max/min/etc.) for more parallelism
CUDA Threads and Atomics – Slide 15
Atomics: Summary
How a streaming multiprocessor (SM) executes threadsOverview of how a streaming multiprocessor
worksSIMT ExecutionDivergence
CUDA Threads and Atomics – Slide 16
Topic 2: Streaming Multiprocessor Execution and Divergence
Hardware schedules thread blocks onto available SMsNo guarantee of ordering among thread blocksHardware will schedule thread blocks as soon as a
previous thread block finished CUDA Threads and Atomics – Slide 17
Scheduling Blocks onto SMsThread Block 5
Thread Block 27
Thread Block 61
Streaming Multiprocessor
Thread Block 2001
A warp = 32 threads launched togetherUsually execute together as well
CUDA Threads and Atomics – Slide 18
Recall: Warps
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU
Control
ALU ALU ALU
Control
ALU ALU ALU
Each thread block is mapped to one or more warps
The hardware schedules each warp independently
CUDA Threads and Atomics – Slide 19
Mapping of Thread Blocks
Thread Block N (128 threads)
TB N W1TB N W2TB N W3TB N W4
SM implements zero-overhead warp schedulingAt any time, only one of the warps is executed
by SM*Warps whose next instruction has its inputs
ready for consumption are eligible for execution
Eligible warps are selected for execution on a prioritized scheduling policy
All threads in a warp execute the same instruction when selected
CUDA Threads and Atomics – Slide 20
Thread Scheduling Example
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time
What happens if you have the following code?
CUDA Threads and Atomics – Slide 21
Control Flow Divergence
if (foo(threadIdx.x)){ do_A();}else{ do_B();}
This is called warp divergence – CUDA will generate correct code to handle this, but to understand the performance you need to understand what CUDA does with it
CUDA Threads and Atomics – Slide 22
Control Flow Divergence (cont.)
if (foo(threadIdx.x)){ do_A();}else{ do_B();}
CUDA Threads and Atomics – Slide 23
Control Flow Divergence (cont.)
From Fung et al. MICRO ’07
Branch
Path A
Path B
Branch
Path A
Path B
Nested branches are handled as well
CUDA Threads and Atomics – Slide 24
Control Flow Divergence (cont.)
if (foo(threadIdx.x)){ if (bar(threadIdx.x)) do_A(); else do_B();}else do_C();
CUDA Threads and Atomics – Slide 25
Control Flow Divergence (cont.)
BranchBranch
Path A
Path C
Branch
Path B
You don’t have to worry about divergence for correctnessMostly true, except corner cases (for example
intra-warp locks)You might have to think about it for
performanceDepends on your branch conditions
CUDA Threads and Atomics – Slide 26
Control Flow Divergence (cont.)
One solution: NVIDIA GPUs have predicated instructions which are carried out only if a logical flag is true.
In the previous example, all threads compute the logical predicate and two predicated instructions
CUDA Threads and Atomics – Slide 27
Control Flow Divergence (cont.)
p = (foo(threadIdx.x)); p: do_A();!p: do_B();
Performance drops off with the degree of divergence
CUDA Threads and Atomics – Slide 28
Control Flow Divergence (cont.)
0 2 4 6 8 10 12 14 16 180
5
10
15
20
25
30
35
Divergence
Performan
ce
switch (threadIdx.x % N) { case 0: ... case 1: ...}
Performance drops off with the degree of divergenceIn worst case, effectively lose a factor of 32 in
performance if one thread needs expensive branch, while rest do nothing
Another example: processing a long list of elements where, depending on run-time values, a few require very expensive processing GPU implementation:first process list to build two sub-lists of “simple”
and “expensive” elementsthen process two sub-lists separately
CUDA Threads and Atomics – Slide 29
Control Flow Divergence (cont.)
Already introduced __syncthreads(); which forms a barrier – all threads wait until every one has reached this point.
When writing conditional code, must be careful to make sure that all threads do reach the __syncthreads();
Otherwise, can end up in deadlock
CUDA Threads and Atomics – Slide 30
Synchronization
Fermi supports some new synchronisation instructions which are similar to __syncthreads() but have extra capabilities:int __syncthreads_count(predicate)
counts how many predicates are trueint __syncthreads_and(predicate)
returns non-zero (true) if all predicates are trueint __syncthreads_or(predicate)
returns non-zero (true) if any predicate is true
CUDA Threads and Atomics – Slide 31
Synchronization (cont.)
There are similar warp voting instructions which operate at the level of a warp:int __all(predicate)
returns non-zero (true) if all predicates in warp are true
int __any(predicate)returns non-zero (true) if any predicate is true
unsigned int __ballot(predicate)sets nth bit based on nth predicate
CUDA Threads and Atomics – Slide 32
Warp voting
Use very judiciouslyAlways include a max_iter in your spinloop!Decompose your data and your locks
CUDA Threads and Atomics – Slide 33
Topic 3: Locks
Problem: when a thread writes data to device memory the order of completion is not guaranteed, so global writes may not have completed by the time the lock is unlockedCUDA Threads and Atomics – Slide 34
Example: Global atomic lock// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free locklock = 0;}
CUDA Threads and Atomics – Slide 35
Example: Global atomic lock (cont.)
// global variable: 0 unlocked, 1 locked__device__ int lock=0;__global__ void kernel(...) {...// set lockdo {} while(atomicCAS(&lock,0,1));...// free lock__threadfence(); // wait for writes to finishlock = 0;}
__threadfence_block();wait until all global and shared memory writes
are visible to all threads in block__threadfence();
wait until all global and shared memory writes are visible to all threads in block (or all threads, for global data)
CUDA Threads and Atomics – Slide 36
Example: Global atomic lock (cont.)
lots of esoteric capabilities – don’t worry about most of
themessential to understand warp divergence –
can have avery big impact on performance__syncthreads(); is vital the rest can be ignored until you have a
critical need – then read the documentation carefully and look for examples in the SDK
CUDA Threads and Atomics – Slide 37
Summary
Based on original material fromOxford University: Mike GilesStanford University
Jared Hoberock, David TarjanRevision history: last updated 8/8/2011.
CUDA Threads and Atomics – Slide 38
End Credits