![Page 1: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/1.jpg)
September 4, 2011 Gabriel Noaje – Source-to-source code translator : OpenMP to CUDA 1
University of Reims Champagne-Ardenne, France
CUDA Training Day June 26, 2012
GPU Computing with CUDA
Gabriel Noaje [email protected]
![Page 2: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/2.jpg)
![Page 3: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/3.jpg)
INTRODUCTION TO MASSIVELY PARALLEL COMPUTING
![Page 4: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/4.jpg)
Moore’s Law (paraphrased)
“The number of transistors on an integrated circuit doubles every two years.”
– Gordon E. Moore
![Page 5: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/5.jpg)
Moore’s Law (visualized)
Credits: Wikimedia
![Page 6: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/6.jpg)
Serial Performance Scaling is Over
! Cannot continue to scale processor frequencies ! no 10 GHz chips
! Cannot continue to increase power consumption
! can’t melt chip
! Can continue to increase transistor density ! as per Moore’s Law
![Page 7: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/7.jpg)
How to Use Transistors?
! Instruction-level parallelism ! out-of-order execution, speculation, … ! vanishing opportunities in power-constrained world
! Data-level parallelism ! vector units, SIMD execution, … ! increasing … SSE, AVX, Cell SPE, Clearspeed, GPU
! Thread-level parallelism ! increasing … multithreading, multicore, manycore ! Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, …
![Page 8: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/8.jpg)
! A quiet revolution and potential build-up ! Computation: TFLOPs vs. 100 GFLOPs
! GPU in every PC – massive volume & potential impact
Why Massively Parallel Processing?
T12
Westmere NV30 NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
![Page 9: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/9.jpg)
Why Massively Parallel Processing? ! A quiet revolution and potential build-up
! Bandwidth: ~10x
! GPU in every PC – massive volume & potential impact
NV30 NV40 G70
G80
GT200
T12
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Westmere
![Page 10: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/10.jpg)
KEPLER ARCHITECTURE
![Page 11: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/11.jpg)
Kepler GK110 Block Diagram
Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3
![Page 12: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/12.jpg)
Kepler GK110 SMX vs Fermi SM
![Page 13: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/13.jpg)
SMX: Efficient Performance
192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories 65536 registers 64KB shared memory 48KB cache
![Page 14: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/14.jpg)
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy:
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Block IDs and Thread IDs • Each thread uses IDs to decide
what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D
• Simplifies memory
addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – …
![Page 15: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/15.jpg)
The “New” Moore’s Law
! Computers no longer get faster, just wider
! You must re-think your algorithms to be parallel !
! Data-parallel computing is most scalable solution ! Otherwise: refactor code for 2 cores ! You will always have more data than cores –
build the computation around the data
8 cores 4 cores 16 cores…
![Page 16: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/16.jpg)
Processor Memory Processor Memory
Global Memory
Generic Multicore Chip
! Handful of processors each supporting ~1 hardware thread
! On-chip memory near processors (cache, RAM, or both)
! Shared global memory space (external DRAM)
![Page 17: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/17.jpg)
• • • Processor Memory Processor Memory
Global Memory
Generic Manycore Chip
! Many processors each supporting many hardware threads
! On-chip memory near processors (cache, RAM, or both)
! Shared global memory space (external DRAM)
![Page 18: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/18.jpg)
Enter the GPU Computing
! Massive economies of scale
! Massively parallel
![Page 19: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/19.jpg)
3D animation
Movie playing
CAD (Computer-Aided Design) Games
![Page 20: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/20.jpg)
GPU Evolution
! High throughput computation ! GeForce GTX 280: 933 GFLOP/s
! High bandwidth memory ! GeForce GTX 280: 140 GB/s
! High availability to all ! 180+ million CUDA-capable GPUs in the wild
1995 2000 2005 2010
RIVA 128 3M xtors
GeForce® 256 23M xtors
GeForce FX 125M xtors
GeForce 8800 681M xtors
GeForce 3 60M xtors
“Fermi” 3B xtors
![Page 21: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/21.jpg)
Why is this different from a CPU?
! Different goals produce different designs ! GPU assumes work load is highly parallel ! CPU must be good at everything, parallel or not
! CPU: minimize latency experienced by 1 thread ! big on-chip caches ! sophisticated control logic
! GPU: maximize throughput of all threads ! # threads in flight limited by resources => lots of
resources (registers, bandwidth, etc.) ! multithreading can hide latency => skip the big caches ! share control logic across many threads
![Page 22: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/22.jpg)
Financial analysis Molecular docking
Medical MRI Geological modeling
Weather simulation
Computer virus detection
![Page 23: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/23.jpg)
• = “Compute Unified Device Architecture” • extends the ANSI C standard • gentle learning curve (compared to Cg, HLSL, etc.) • opens the underlying architecture to the user
www.nvidia.com/getcuda (driver + toolkit + SDK + docs + …)
Initially - use graphic calls: • Cg by NVIDIA • HLSL by Microsoft Nowadays - rich GPU programming ecosystem: • ATI Stream by AMD • CUDA by NVIDIA • OpenCL by Khronos Group • Direct Computing by Microsoft
( NVIDIA hardware specific )
![Page 24: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/24.jpg)
CUDA: Scalable parallel programming
! Augment C/C++ with minimalist abstractions ! let programmers focus on parallel algorithms ! not mechanics of a parallel programming language
! Provide straightforward mapping onto hardware ! good fit to GPU architecture ! maps well to multi-core CPUs too
! Scale to 100s of cores & 10,000s of parallel threads ! GPU threads are lightweight — create / switch is free ! GPU needs 1000s of threads for full utilization
![Page 25: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/25.jpg)
Key Parallel Abstractions in CUDA
! Hierarchy of concurrent threads
! Lightweight synchronization primitives
! Shared memory model for cooperating threads
![Page 26: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/26.jpg)
Hierarchy of concurrent threads
! Parallel kernels composed of many threads ! all threads execute the same sequential program
! Threads are grouped into thread blocks ! threads in the same block can cooperate
! Threads/blocks have unique IDs
Thread t
t0 t1 … tB
Block b
![Page 27: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/27.jpg)
CUDA Model of Parallelism
! CUDA virtualizes the physical hardware ! thread is a virtualized scalar processor (registers, PC, state) ! block is a virtualized multiprocessor (threads, shared mem.)
! Scheduled onto physical hardware without pre-emption ! threads/blocks launch & run to completion ! blocks should be independent
• • • Block Memory Block Memory
Global Memory
![Page 28: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/28.jpg)
Heterogeneous Computing
Multicore CPU
![Page 29: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/29.jpg)
C for CUDA ! Philosophy: provide minimal set of extensions necessary to expose power
! Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { }
! Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32];
! Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel
! Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization
![Page 30: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/30.jpg)
CUDA PROGRAMMING BASICS
![Page 31: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/31.jpg)
Outline of CUDA Basics
! Basic Kernels and Execution on GPU ! Basic Memory Management ! Coordinating CPU and GPU Execution
! See the Programming Guide for the full API
![Page 32: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/32.jpg)
CUDA Programming Model
! Parallel code (kernel) is launched and executed on a device by many threads
! Launches are hierarchical ! Threads are grouped into blocks ! Blocks are grouped into grids
! Familiar serial code is written for a thread ! Each thread is free to execute a unique code
path ! Built-in thread and block ID variables
![Page 33: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/33.jpg)
High Level View
SMEM
SMEM
SMEM
SMEM
Global Memory CPU Chipset
PCIe
![Page 34: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/34.jpg)
Blocks of threads run on an SM
Thread
Memory
Threadblock
Per-block Shared Memory
SMEM
Streaming Processor Streaming Multiprocessor
Registers
Memory
![Page 35: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/35.jpg)
Whole grid runs on GPU
Many blocks of threads
. . .
SMEM
SMEM
SMEM
SMEM
Global Memory
![Page 36: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/36.jpg)
Thread Hierarchy
! Threads launched for a parallel section are partitioned into thread blocks ! Grid = all blocks for a given launch
! Thread block is a group of threads that can: ! Synchronize their execution ! Communicate via shared memory
![Page 37: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/37.jpg)
Memory Model
Kernel 0
. . . Per-device
Global Memory
. . .
Kernel 1
Sequential Kernels
![Page 38: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/38.jpg)
Memory Model
Device 0 memory
Device 1 memory
Host memory cudaMemcpy()
![Page 39: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/39.jpg)
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Device Code
![Page 40: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/40.jpg)
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Host Code
![Page 41: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/41.jpg)
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
![Page 42: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/42.jpg)
Example: Host code for vecAdd (2)
// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float),
cudaMemcpyDeviceToHost) ); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
![Page 43: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/43.jpg)
IDs and Dimensions ! Threads: ! 3D IDs, unique within a block
! Blocks: ! 2D IDs, unique within a grid
! Dimensions set at launch ! Can be unique for each grid
! Built-in variables: ! threadIdx, blockIdx ! blockDim, gridDim
Device Grid 1
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Block (1, 1)
Thread (0, 1)
Thread (1, 1)
Thread (2, 1)
Thread (3, 1)
Thread (4, 1)
Thread (0, 2)
Thread (1, 2)
Thread (2, 2)
Thread (3, 2)
Thread (4, 2)
Thread (0, 0)
Thread (1, 0)
Thread (2, 0)
Thread (3, 0)
Thread (4, 0)
![Page 44: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/44.jpg)
__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; }
Kernel Variations and Output
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
![Page 45: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/45.jpg)
Code executed on GPU ! C/C++ with some restrictions: ! Can only access GPU memory ! No variable number of arguments ! No static variables ! No recursion ! No dynamic polymorphism
! Must be declared with a qualifier: ! __global__ : launched by CPU, cannot be called from GPU must return void ! __device__ : called from other GPU functions, cannot be called by the CPU ! __host__ : can be called by CPU ! __host__ and __device__ qualifiers can be combined
! Function is compiled for both host and device ! Sample use: overloading operators
![Page 46: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/46.jpg)
Memory Spaces
! CPU and GPU have separate memory spaces ! Data is moved across PCIe bus ! Use functions to allocate/set/copy memory on
GPU ! Very similar to corresponding C functions
! Pointers are just addresses ! Can’t tell from the pointer value whether the
address is on CPU or GPU ! Must exercise care when dereferencing: ! Dereferencing CPU pointer on GPU will likely crash ! Same for vice versa
![Page 47: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/47.jpg)
CUDA Device Memory Model
! Device code can: ! R/W per-thread registers ! R/W per-thread local memory ! R/W per-block shared memory ! R/W per-grid global memory ! Read only per-grid constant memory
! Host code can: ! Transfer data to/from per-grid global and
constant memories
![Page 48: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/48.jpg)
GPU Memory Allocation / Release
! Host (CPU) manages device (GPU) memory: ! cudaMalloc (void ** pointer, size_t nbytes) ! cudaMemset (void * pointer, int value, size_t
count) ! cudaFree (void* pointer)
int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);
![Page 49: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/49.jpg)
Data Copies
! cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); ! returns after the copy is complete ! blocks CPU thread until all bytes have been copied ! doesn’t start copying until previous CUDA calls complete
! enum cudaMemcpyKind ! cudaMemcpyHostToDevice ! cudaMemcpyDeviceToHost ! cudaMemcpyDeviceToDevice
! Non-blocking copies are also available ! cudaMemcpyAsync
![Page 50: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/50.jpg)
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers
![Page 51: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/51.jpg)
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
![Page 52: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/52.jpg)
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
![Page 53: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/53.jpg)
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }
![Page 54: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/54.jpg)
Example: Shuffling Data
// Reorder values based on keys // Each thread moves one element __global__ void shuffle(int* prev_array, int*
new_array, int* indices) { int i = threadIdx.x + blockDim.x * blockIdx.x; new_array[i] = prev_array[indices[i]]; }
int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }
Host Code
![Page 55: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/55.jpg)
__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }
Kernel with 2D Indexing
![Page 56: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/56.jpg)
int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; }
__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }
![Page 57: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/57.jpg)
Blocks must be independent
! Any possible interleaving of blocks should be valid ! presumed to run to completion without pre-
emption ! can run in any order ! can run concurrently OR sequentially
! A thread block is a batch of threads that can cooperate with each other by: ! synchronizing their execution: _syncthreads() ! sharing data with shared memory
! Independence requirement gives scalability
![Page 58: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/58.jpg)
CUDA MEMORIES
![Page 59: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/59.jpg)
Hardware Implementation of CUDA Memories
! Each thread can: ! Read/write per-thread
registers ! Read/write per-thread
local memory ! Read/write per-block
shared memory ! Read/write per-grid
global memory ! Read/only per-grid
constant memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
![Page 60: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/60.jpg)
CUDA Variable Type Qualifiers
! “automatic” scalar variables without qualifier reside in a register ! compiler will spill to thread local memory
! “automatic” array variables without qualifier reside in thread-local memory
Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application
![Page 61: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/61.jpg)
CUDA Variable Type Performance
! scalar variables reside in fast, on-chip registers ! shared variables reside in fast, on-chip memories ! thread-local arrays & global variables reside in
uncached off-chip memory ! constant variables reside in cached off-chip memory
Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x
![Page 62: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/62.jpg)
Where to declare variables?
Can host access it?
Outside of any function
In the kernel
Yes No
__constant__ int constant_var;
__device__ int global_var;
int var;
int array_var[10]; __shared__ int shared_var;
![Page 63: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/63.jpg)
Example – thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }
![Page 64: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/64.jpg)
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
![Page 65: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/65.jpg)
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
Two loads
![Page 66: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/66.jpg)
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
// once by thread i // again by thread i+1
![Page 67: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/67.jpg)
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
![Page 68: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/68.jpg)
Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }
![Page 69: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/69.jpg)
Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; } }
![Page 70: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/70.jpg)
Example – shared variables // when the size of the array isn’t known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);
![Page 71: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/71.jpg)
A Common Programming Strategy
! Partition data into subsets that fit into shared memory
![Page 72: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/72.jpg)
A Common Programming Strategy
! Handle each data subset with one thread block
![Page 73: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/73.jpg)
A Common Programming Strategy
! Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
![Page 74: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/74.jpg)
A Common Programming Strategy
! Perform the computation on the subset from shared memory
![Page 75: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/75.jpg)
A Common Programming Strategy
! Copy the result from shared memory back to global memory
![Page 76: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/76.jpg)
A Common Programming Strategy
! Carefully partition data according to access patterns ! Read-only è __constant__ memory (fast) ! R/W & shared within block è __shared__ memory
(fast) ! R/W within each thread è registers (fast) ! Indexed R/W within each thread è local memory
(slow) ! R/W inputs/results è cudaMalloc‘ed global
memory (slow)
![Page 77: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/77.jpg)
Communication Through Memory
! Question:
__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }
![Page 78: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/78.jpg)
Communication Through Memory
! This is a race condition ! The result is undefined ! The order in which threads access the variable is
undefined without explicit coordination ! Use barriers (e.g., __syncthreads) or atomic
operations (e.g., atomicAdd) to enforce well-defined semantics
![Page 79: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/79.jpg)
Communication Through Memory
! Use __syncthreads to ensure data is ready for access
__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }
![Page 80: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/80.jpg)
Communication Through Memory
! Use atomic operations to ensure exclusive access to a variable
// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }
![Page 81: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/81.jpg)
Hierarchical Atomics
! Divide & Conquer ! Per-thread atomicAdd to a __shared__ partial sum ! Per-block atomicAdd to the total sum
Σ
Σ0 Σ1 Σι
![Page 82: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/82.jpg)
SMX EXECUTION
![Page 83: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/83.jpg)
How an SM executes threads
! Overview of how a Stream Multiprocessor works
! SIMT Execution ! Divergence
![Page 84: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/84.jpg)
Scheduling Blocks onto SMs
Thread Block 5
Thread Block 27
Thread Block 61
Streaming Multiprocessor
Thread Block 2001
! HW Schedules thread blocks onto available SMs ! No guarantee of ordering among thread blocks ! HW will schedule thread blocks as soon as a previous
thread block finishes
![Page 85: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/85.jpg)
Warps
! Each thread block is executed as 32-thread warps ! An implementation decision, not part of the
CUDA programming model ! Warps are scheduling units in SMs
! If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps – There are 8 x 3 = 24 Warps
![Page 86: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/86.jpg)
Mapping of Thread Blocks
! Each thread block is mapped to one or more warps
! The hardware schedules each warp independently
Thread Block N (128 threads)
TB N W1
TB N W2
TB N W3
TB N W4
![Page 87: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/87.jpg)
24
Thread Scheduling Example
! SM implements zero-overhead warp scheduling ! At any time, only one of the warps is executed by
SM ! Warps whose next instruction has its inputs
ready for consumption are eligible for execution ! Eligible Warps are selected for execution on a
prioritized scheduling policy ! All threads in a warp execute the same
instruction when selected
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
![Page 88: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/88.jpg)
Control Flow Divergence
! What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B(); }
![Page 89: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/89.jpg)
Control Flow Divergence
Branch
Path A
Path B
Branch
Path A
Path B
![Page 90: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/90.jpg)
Control Flow Divergence
! Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } else do_C();
![Page 91: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/91.jpg)
Control Flow Divergence
Branch Branch
Path A
Path C
Branch
Path B
![Page 92: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/92.jpg)
Control Flow Divergence
! You don’t have to worry about divergence for correctness
! You might have to think about it for performance ! Depends on your branch conditions
![Page 93: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/93.jpg)
Control Flow Divergence
! Performance drops off with the degree of divergence
switch(threadIdx.x % N) { case 0: ... case 1: ... }
![Page 94: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/94.jpg)
Divergence
0
5
10
15
20
25
30
35
0 2 4 6 8 10 12 14 16 18
Perf
orm
ance
Divergence
![Page 95: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/95.jpg)
CUDA
.c / .cpp Host code
.gpu Device code
Object File Fat bin
C/C++ Libraries
CUDA Libraries
Executable (with embedded GPU code)
Compiling a CUDA program
![Page 96: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/96.jpg)
CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) clean: $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core
![Page 97: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/97.jpg)
OPENACC API
![Page 98: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/98.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
OpenACC Directives
Maximum Flexibility
Easily Accelerate Applications
![Page 99: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/99.jpg)
OpenACC Directives
Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience
CPU GPU
Your original Fortran or C
code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs & multicore CPUs
OpenACC Compiler
Hint
![Page 100: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/100.jpg)
OpenACC Open Programming Standard for Parallel Computing
![Page 101: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/101.jpg)
Easy: Directives are the easy path to accelerate compute intensive applications
Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors
Powerful: GPU Directives allow complete access to the massive parallel power of a GPU
OpenACC The Standard for GPU Directives
![Page 102: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/102.jpg)
High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran
OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers
Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator
Programming model allows programmers to start simple
Enhance with additional guidance for compiler on loop mappings, data location, and other performance details
Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.
![Page 103: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/103.jpg)
OpenACC Specification and Website
Full OpenACC 1.0 Specification available online
http://www.openacc-standard.org
Quick reference card also available Available compilers:
Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)
![Page 104: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/104.jpg)
subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran
![Page 105: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/105.jpg)
Directive Syntax
Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive
C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block
![Page 106: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/106.jpg)
CUDA = much more
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
20
• Atomic, shuffle operations • Streams • GPU Direct • CUBLAS, NPP, CUFFT, … • OpenCL
![Page 107: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/107.jpg)
References • www.nvidia.com/cuda (CUDA zone, developer zone) • Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu • CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot • CUDA Application Design and Development by Rob Farber
![Page 108: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/108.jpg)
Full Version Questions ?
![Page 109: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/109.jpg)
! Generalize adjacent_difference example
! AB = A * B ! Each element ABij ! = dot(row(A,i),col(B,j))
! Parallelization strategy ! Thread à ABij
! 2D kernel
Matrix Multiplication Example
![Page 110: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/110.jpg)
First Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }
![Page 111: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/111.jpg)
Idea: Use __shared__ memory to reuse global data
! Each input element is read by width threads
! Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
width
![Page 112: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/112.jpg)
Tiled Multiply
! Partition kernel loop into phases
! Load a tile of both matrices into __shared__ each phase
! Each phase, each thread computes a partial result
TILE_WIDTH
![Page 113: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/113.jpg)
Better Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;
![Page 114: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/114.jpg)
Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) {
// collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }
![Page 115: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/115.jpg)
Use of Barriers in mat_mul
! Two barriers per phase: ! __syncthreads after all data is loaded into __shared__
memory ! __syncthreads after all data is read from __shared__
memory ! Note that second __syncthreads in phase p guards the
load in phase p+1
! Use barriers to guard data ! Guard against using uninitialized data ! Guard against bashing live data
![Page 116: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/116.jpg)
First Order Size Considerations
! Each thread block should have many threads ! TILE_WIDTH = 16 à 16*16 = 256 threads
! There should be many thread blocks ! 1024*1024 matrices à 64*64 = 4096 thread blocks ! TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads ! Full occupancy
! Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op) ! Compare to 4B/op
![Page 117: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/117.jpg)
TILE_SIZE Effects
![Page 118: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL](https://reader031.vdocuments.us/reader031/viewer/2022022005/5ab619047f8b9a7c5b8d5cb0/html5/thumbnails/118.jpg)
Memory Resources as Limit to Parallelism
! Effective use of different memory resources reduces the number of accesses to global memory
! These resources are finite! ! The more memory locations each thread requires à
the fewer threads an SM can accommodate
Resource Per GTX480 SM Full Occupancy on GTX480
Registers 32768 <= 32768 / 1024 threads = 32 per thread
__shared__ Memory 48KB <= 48KB / 8 blocks = 6KB per block