advanced cuda 02
TRANSCRIPT
-
7/30/2019 Advanced CUDA 02
1/49
NVIDIA Corporation 2010
Advanced C
-
7/30/2019 Advanced CUDA 02
2/49
NVIDIA Corporation 2010
Agenda
System optimizations
Transfers between host and device
Kernel optimizations
Measuring performance effective bandwidth
Coalescing
Shared memory
Constant memory
Textures
-
7/30/2019 Advanced CUDA 02
3/49
NVIDIA Corporation 2010
HOST DEVICE TRANSFERSSystem Optimizations
-
7/30/2019 Advanced CUDA 02
4/49
NVIDIA Corporation 2010
PCIe Performance
Device to host bandwidth is much lower than device to dev
PCIe x16 Gen 2: 8 GB/s
C1060: 102 GB/s
Minimize transfers
Intermediate data can be allocated, operated on, and deallocated
copying to host memory
Group transfers together
One large transfer is better than many small transfers
-
7/30/2019 Advanced CUDA 02
5/49 NVIDIA Corporation 2010
Pinned (non-pageable) Memory
Pinned memory enables:
faster PCIe copies (~2x throughput on FSB systems)
memcopies asynchronous with CPU
memcopies asynchronous with GPU
Allocation
cudaHostAlloc() / cudaFreeHost()
i.e. Instead of malloc / free
Implication:
Pinned memory is essentially removed from host virtual memory
-
7/30/2019 Advanced CUDA 02
6/49 NVIDIA Corporation 2010
Asynchronous API and Streams
Default API:
Kernel launches are asynchronous with CPU
Memcopies (D2H, H2D) block CPU thread
CUDA calls are serialized by the driver
Streams and async functions provide:
Memcopies (D2H, H2D) asynchronous with CPU
Ability to concurrently execute a kernel and a memcopy
Stream = sequence of operations that execute in issue-ord
Operations from different streams can be interleaved
A kernel and memcopy from different streams can be overlapped
-
7/30/2019 Advanced CUDA 02
7/49 NVIDIA Corporation 2010
Overlap CPU Work with Kernel
Kernel launches are asynchronous
cudaMemcpy(a_d, a_h, size, // Blocks CPU until transf
cudaMemcpyHostToDevice);
kernel(a_d); // Asynchronous launch
cpuFunction(); // Overlaps kernel
cudaMemcpy(a_h, a_d, size, // Block until kernel is c
cudaMemcpyDeviceToHost); // then block until transf
-
7/30/2019 Advanced CUDA 02
8/49 NVIDIA Corporation 2010
Overlap CPU Work with Kernel/Memcpy
Overlap the CPU with the memory transfer
Default stream is 0
cudaMemcpyAsync(a_d, a_h, size, // Asynchronous transfer
cudaMemcpyHostToDevice, 0);
kernel(a_d); // Asynchronous launch (qu
// memcpy since in same st
cpuFunction(); // Overlaps memcpy and ker
-
7/30/2019 Advanced CUDA 02
9/49 NVIDIA Corporation 2010
Overlap Kernel with Memcpy
Requirements:
D2H or H2D memcpy from pinned memory
Device with compute capability 1.1 (G84 and later)
Kernel and memcpy in different, non-0 streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1); // Create stream handlescudaStreamCreate(&stream2);
cudaMemcpyAsync(dst, src, size, // Asynchronous
dir, stream1);
kernel(); // overlaps the transfer
-
7/30/2019 Advanced CUDA 02
10/49 NVIDIA Corporation 2010
Host Device Synchronization
Context based
Block until all outstanding CUDA commands complete
cudaThreadSynchronize(), cudaMemcpy(...)
Stream based
Block until all outstanding CUDA commands in stream complete
cudaStreamSynchronize(stream)
Test whether stream is idle (empty), returns cudaSuccess or
cudaErrorNotReady non-blocking
cudaStreamQuery(stream)
-
7/30/2019 Advanced CUDA 02
11/49 NVIDIA Corporation 2010
Host Device Synchronization
Event based (within streams)
Record event in a stream
cudaEventRecord(event, stream)
Event is recorded (assigned timestamp) when reached by the GP
Block until event is recorded
cudaEventSynchronize(event)
Test whether event has been recorded, returns cudaSuccess o
cudaErrorNotReady non-blocking
cudaEventQuery(stream)
-
7/30/2019 Advanced CUDA 02
12/49 NVIDIA Corporation 2010
Zero-Copy Memory
Device directly accesses host memory (i.e. no explicit cop
Check canMapHostMemory device property
Setup from host
cudaSetDeviceFlags(cudaDeviceMapHost);
...
cudaHostAlloc((void **)&a_h, size, cudaHostAllo
cudaHostGetDevicePointer((void **)&a_d, (void *
kernel(a_d, N);
-
7/30/2019 Advanced CUDA 02
13/49
NVIDIA Corporation 2010
Zero-Copy Memory Considerations
Always beneficial for integrated devices
Integrated devices access host memory directly
Test integrated device property
Typically beneficial if data only read/written once
Copy input data from CPU to GPU memory
Run one kernel
Copy output data back from GPU to CPU memory
Coalescing is even more important with zero-copy!
-
7/30/2019 Advanced CUDA 02
14/49
NVIDIA Corporation 2010
MEASURING PERFORMANCEKernel Optimizations
-
7/30/2019 Advanced CUDA 02
15/49
NVIDIA Corporation 2010
Theoretical Bandwidth
Device bandwidth for C1060
Memory clock: 800 MHz (800 * 106) Hz
Memory interface: 512 bits (512 / 8) bytes
DDR: yes 2 bits per cycle
Theoretical bandwidth = (800 * 106) * (512 / 8) * 2 bytes/sec
= 102 GB/s
Be consistent in definition of giga (10243 or 109)
-
7/30/2019 Advanced CUDA 02
16/49
NVIDIA Corporation 2010
Effective Bandwidth
Measure time and calculate the effective bandwidth
For example, copying array of N floats
Size of data: N * sizeof(float) (N * 4)
Read and write: yes 2 ops per word
Time: t
Effective bandwidth = (N * 4) * 2 / t bytes/s
-
7/30/2019 Advanced CUDA 02
17/49
NVIDIA Corporation 2010
Global Memory Throughput Metric
Many applications are memory throughput bound
When coding from scratch:Start with memory operations first, achieve good throughput
Add the arithmetic, measuring performance as you go
When optimizing:
Measure effective memory throughput
Compare to the theoretical bandwidth
70-80% is very good, ~50% is good if arithmetic is nontrivial
Measuring throughput
From the app point of view (useful bytes)
From the hw point of view (actual bytes moved across the bus)
The two are likely to be different
Due to coalescing, discrete bus transaction sizes
-
7/30/2019 Advanced CUDA 02
18/49
NVIDIA Corporation 2010
Measuring Memory Throughput
Latest Visual Profilerreports memory throughput
From HW point of viewBased on counters for one TPC (3 multiprocessors)
Need compute capability 1.2 or higherGPU
-
7/30/2019 Advanced CUDA 02
19/49
NVIDIA Corporation 2010
Measuring Memory Throughput
Visual Profilerreports memory throughput
From HW point of viewBased on counters for one TPC (3 multiprocessors)
Need compute capability 1.2 or higherGPU
How throughput is calculated
Count load/store bus transactions of each size (32B, 64B, 128B) on the TPC
Extrapolate from one TPC to the entire GPU
Multiply by (total threadblocks / threadblocks on TPC)
i.e. (grid size / cta launched)
-
7/30/2019 Advanced CUDA 02
20/49
NVIDIA Corporation 2010
COALESCING
Kernel Optimizations
-
7/30/2019 Advanced CUDA 02
21/49
NVIDIA Corporation 2010
GMEM Coalescing: Compute Capability 1.2
Possible GPU memory bus transaction sizes:
32B, 64B, or128B
Transaction segment must be aligned
First address = multiple of segment size
Hardware coalescing for each half-warp (16 threads):
Memory accesses are handled per half-warpsCarry out the smallest possible number of transactions
Reduce transaction size when possible
-
7/30/2019 Advanced CUDA 02
22/49
-
7/30/2019 Advanced CUDA 02
23/49
NVIDIA Corporation 2010
HW Steps when Coalescing for half-warp
Find the memory segment that contains the address requested by the lowest
thread:32B segment for8-bit data
64B segment for16-bit data
128B segment for32, 64 and 128-bit data
Find all other active threads whose requested address lies in the same segme
Reduce the transaction size, if possible:
If size == 128B and only the lower or upper half is used, reduce transaction to 64
If size == 64B and only the lower or upper half is used, reduce transaction to 32B
Applied even if 64B was a reduction from 128B
Carry out the transaction, mark serviced threads as inactive
Repeat until all threads in the half-warp are serviced
-
7/30/2019 Advanced CUDA 02
24/49
-
7/30/2019 Advanced CUDA 02
25/49
NVIDIA Corporation 2010
Threads 0-15: 4-byte words at address 116
Thread 0 is lowest active, accesses address 116
128-byte segment: 0-127
96 192128
128B segment
160 224
t1 t2...
t0 t15
0 32 64
t3
-
7/30/2019 Advanced CUDA 02
26/49
NVIDIA Corporation 2010
Threads 0-15: 4-byte words at address 116
Thread 0 is lowest active, accesses address 116
128-byte segment: 0-127 (reduce to 64B)
96 192128
64B segment
160 224
t1 t2...
t0 t15
0 32 64
t3
-
7/30/2019 Advanced CUDA 02
27/49
NVIDIA Corporation 2010
Threads 0-15: 4-byte words at address 116
Thread 0 is lowest active, accesses address 116
128-byte segment: 0-127 (reduce to 32B)
96 192128
32B transaction
160 224
t1 t3...
t0 t15
0 32 64
t2
Th d 0 15 4 b t d t dd 116
-
7/30/2019 Advanced CUDA 02
28/49
NVIDIA Corporation 2010
Threads 0-15: 4-byte words at address 116
Thread 3 is lowest active, accesses address 128
128-byte segment: 128-255
96 192128
128B segment
160 224
t1 t2...
t0 t15
0 32 64
t3
Th d 0 15 4 b t d t dd 116
-
7/30/2019 Advanced CUDA 02
29/49
NVIDIA Corporation 2010
Threads 0-15: 4-byte words at address 116
Thread 3 is lowest active, accesses address 128
128-byte segment: 128-255 (reduce to 64B)
96 192128
64B transaction
160 224
t1 t2...
t0 t15
0 32 64
t3
-
7/30/2019 Advanced CUDA 02
30/49
C i C t C biliti
-
7/30/2019 Advanced CUDA 02
31/49
NVIDIA Corporation 2010
Comparing Compute Capabilities
Compute capability 1.0-1.1
Requires threads in a half-warp to:Access a single aligned 64B, 128B, or256B segment
Threads must issue addresses in sequence
If requirements are not satisfied:
Separate 32B transaction for each thread
Compute capability 1.2-1.3
Does not require sequential addressing by threads
Performance degrades gracefully when a half-warp addresses m
segments
E i t I t f Add Ali t
-
7/30/2019 Advanced CUDA 02
32/49
NVIDIA Corporation 2010
Experiment: Impact of Address Alignment
Assume half-warp accesses a contiguous region
Throughput is maximized when region is aligned on its siz100% of bytes in a bus transaction are useful
Impact of misaligned addressing:
32-bit words, streaming code, Quadro FX5800 (102 GB/s)
0 word offset: 76 GB/s (perfect alignment, typical perf)
8 word offset: 57 GB/s (75% of aligned case)
All others: 46 GB/s (61% of aligned case)
Address Alignment 32 bit words
-
7/30/2019 Advanced CUDA 02
33/49
NVIDIA Corporation 2010
Address Alignment, 32-bit words
8-word (32B) offset from perfect alignment:
Observed 75% of the perfectly aligned perfSegments starting at multiple of32B
One 128B transaction (50% efficiency)
Segments starting at multiple of96B
Two 32B transactions (100% efficiency)
9664320 128
128B transaction
9664320 128
32B32B
Address Alignment 32 bit words
-
7/30/2019 Advanced CUDA 02
34/49
NVIDIA Corporation 2010
Address Alignment, 32-bit words
4-word (16B) offset (other offsets have the same perf):
Observed 61% of the perfectly aligned perfTwo types of segments, based on starting address
One 128B transaction (50% efficiency)
One 64B and one 32B transaction (67% efficiency)
9664320 128
128B transaction
9664320 128
32B64B
Address Alignment 64 bit words
-
7/30/2019 Advanced CUDA 02
35/49
NVIDIA Corporation 2010
Address Alignment, 64-bit words
Can be analyzed similarly to 32-bit case:
0B offset: 80 GB/s (perfectly aligned)8B offset: 62 GB/s (78% of perfectly aligned)
16B offset: 62 GB/s (78% of perfectly aligned)
32B offset: 68 GB/s (85% of perfectly aligned)
64B offset: 76 GB/s (95% of perfectly aligned)
Compare 0 and 64B offset performance:
Both consume 100% of the bytes
64B: two 64B transactions
0B: single 128B transaction, slightly faster
-
7/30/2019 Advanced CUDA 02
36/49
NVIDIA Corporation 2010
SHARED MEMORY
Kernel Optimizations
Shared Memory
-
7/30/2019 Advanced CUDA 02
37/49
NVIDIA Corporation 2010
Shared Memory
Uses:
Inter-thread communication within a blockCache data to reduce global memory accesses
Avoid non-coalesced access
Organization:
16 banks, 32-bit wide
Successive 32-bit words belong to different banks
Performance:32 bits per bank per 2 clocks per multiprocessor
smem accesses are per 16-threads (half-warp)
Serialization: if n threads (out of 16) access the same bank, n accesses
serially
Broadcast: n threads access the same word in one fetch
Bank Addressing Examples
-
7/30/2019 Advanced CUDA 02
38/49
NVIDIA Corporation 2010
Bank Addressing Examples
No Bank Conflicts No Bank Conflicts
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank Addressing Examples
-
7/30/2019 Advanced CUDA 02
39/49
NVIDIA Corporation 2010
Bank Addressing Examples
2-way Bank Conflicts 8-way Bank Conflict
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0 x
Shared Memory Bank Conflicts
-
7/30/2019 Advanced CUDA 02
40/49
NVIDIA Corporation 2010
Shared Memory Bank Conflicts
warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is no ba
If all threads of a half-warp read identical address, there is no ba
(broadcast)
The slow case:
Bank conflict: multiple threads in the same half-warp access the
Must serialize the accesses
Assess Impact On Performance
-
7/30/2019 Advanced CUDA 02
41/49
NVIDIA Corporation 2010
Assess Impact On Performance
Replace all SMEM indexes with threadIdx.x
Eliminates conflictsWill show how much performance could be improved by eliminat
conflicts
-
7/30/2019 Advanced CUDA 02
42/49
NVIDIA Corporation 2010
CONSTANTS
Kernel Optimizations
Constant Memory
-
7/30/2019 Advanced CUDA 02
43/49
NVIDIA Corporation 2010
Constant Memory
Data stored in global memory, read through a constant-ca
__constant__qualifier in declarationsCan only be read by GPU kernels
Limited to 64KB
To be used when all threads in a warp read the same add
Serializes otherwise (indicated by warp_serialize counter in profi
Throughput:32 bits per warp per clock per multiprocessor
-
7/30/2019 Advanced CUDA 02
44/49
NVIDIA Corporation 2010
TEXTURES
Kernel Optimizations
Textures in CUDA
-
7/30/2019 Advanced CUDA 02
45/49
NVIDIA Corporation 2010
Textures in CUDA
Texture is an object for reading data
Benefits:Data is cached
Helpful when coalescing is a problem
Filtering
Linear / bilinear / trilinear interpolation
Dedicated hardware
Wrap modes (for out-of-bounds addresses)
Clamp to edge / repeat
Addressable in 1D, 2D, or 3D
Using integer or normalized coordinates
Texture Addressing Modes
-
7/30/2019 Advanced CUDA 02
46/49
NVIDIA Corporation 2010
Texture Addressing Modes
1
2
3
0(5.5, 1.5)
0 1 2 3 4
1
23
0(2.5, 0.5)
(1.0, 1.0)
1
2
3
0
Wrap
Out-of-bounds coordinate is
wrapped (modulo arithmetic)
Clamp
Out-of-bounds coordinate i
replaced with the closest b
0 1 2 3 4 0 1 2 3 4
CUDA Texture Types
-
7/30/2019 Advanced CUDA 02
47/49
NVIDIA Corporation 2010
yp
Bound to linear memoryGlobal memory address is bound to a texture
Only 1DInteger addressing, no filtering, no addressing modes
Bound to CUDA arraysBlock linear CUDA array is bound to a texture
1D, 2D, or 3D
Float addressing (size-based or normalized)
Filtering & addressing modes (clamp, repeat) supportedBound to pitch linear
Global memory address is bound to a 2D texture
Float/integer addressing, filtering, and clamp/repeat addressing modCUDA arrays
Using Textures
-
7/30/2019 Advanced CUDA 02
48/49
NVIDIA Corporation 2010
g
Host (CPU) code:
Allocate/obtain memory (global linear/pitch linear, or CUDA arrayCreate a texture reference object (currently must be at file scope
Bind the texture reference to memory/array
Unbind the texture reference when finished, free resources
Device (kernel) code:Fetch using texture reference
Linear memory textures: tex1Dfetch()
Array textures: tex1D() ortex2D() ortex3D()
Pitch linear textures: tex2D()
-
7/30/2019 Advanced CUDA 02
49/49