gpu computing with cuda -...
Post on 24-Mar-2018
221 Views
Preview:
TRANSCRIPT
September 4, 2011 Gabriel Noaje – Source-to-source code translator : OpenMP to CUDA 1
University of Reims Champagne-Ardenne, France
CUDA Training Day June 26, 2012
GPU Computing with CUDA
Gabriel Noaje gabriel.noaje@univ-reims.fr
INTRODUCTION TO MASSIVELY PARALLEL COMPUTING
Moore’s Law (paraphrased)
“The number of transistors on an integrated circuit doubles every two years.”
– Gordon E. Moore
Moore’s Law (visualized)
Credits: Wikimedia
Serial Performance Scaling is Over
! Cannot continue to scale processor frequencies ! no 10 GHz chips
! Cannot continue to increase power consumption
! can’t melt chip
! Can continue to increase transistor density ! as per Moore’s Law
How to Use Transistors?
! Instruction-level parallelism ! out-of-order execution, speculation, … ! vanishing opportunities in power-constrained world
! Data-level parallelism ! vector units, SIMD execution, … ! increasing … SSE, AVX, Cell SPE, Clearspeed, GPU
! Thread-level parallelism ! increasing … multithreading, multicore, manycore ! Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, …
! A quiet revolution and potential build-up ! Computation: TFLOPs vs. 100 GFLOPs
! GPU in every PC – massive volume & potential impact
Why Massively Parallel Processing?
T12
Westmere NV30 NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Why Massively Parallel Processing? ! A quiet revolution and potential build-up
! Bandwidth: ~10x
! GPU in every PC – massive volume & potential impact
NV30 NV40 G70
G80
GT200
T12
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Westmere
KEPLER ARCHITECTURE
Kepler GK110 Block Diagram
Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3
Kepler GK110 SMX vs Fermi SM
SMX: Efficient Performance
192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories 65536 registers 64KB shared memory 48KB cache
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy:
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Block IDs and Thread IDs • Each thread uses IDs to decide
what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D
• Simplifies memory
addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – …
The “New” Moore’s Law
! Computers no longer get faster, just wider
! You must re-think your algorithms to be parallel !
! Data-parallel computing is most scalable solution ! Otherwise: refactor code for 2 cores ! You will always have more data than cores –
build the computation around the data
8 cores 4 cores 16 cores…
Processor Memory Processor Memory
Global Memory
Generic Multicore Chip
! Handful of processors each supporting ~1 hardware thread
! On-chip memory near processors (cache, RAM, or both)
! Shared global memory space (external DRAM)
• • • Processor Memory Processor Memory
Global Memory
Generic Manycore Chip
! Many processors each supporting many hardware threads
! On-chip memory near processors (cache, RAM, or both)
! Shared global memory space (external DRAM)
Enter the GPU Computing
! Massive economies of scale
! Massively parallel
3D animation
Movie playing
CAD (Computer-Aided Design) Games
GPU Evolution
! High throughput computation ! GeForce GTX 280: 933 GFLOP/s
! High bandwidth memory ! GeForce GTX 280: 140 GB/s
! High availability to all ! 180+ million CUDA-capable GPUs in the wild
1995 2000 2005 2010
RIVA 128 3M xtors
GeForce® 256 23M xtors
GeForce FX 125M xtors
GeForce 8800 681M xtors
GeForce 3 60M xtors
“Fermi” 3B xtors
Why is this different from a CPU?
! Different goals produce different designs ! GPU assumes work load is highly parallel ! CPU must be good at everything, parallel or not
! CPU: minimize latency experienced by 1 thread ! big on-chip caches ! sophisticated control logic
! GPU: maximize throughput of all threads ! # threads in flight limited by resources => lots of
resources (registers, bandwidth, etc.) ! multithreading can hide latency => skip the big caches ! share control logic across many threads
Financial analysis Molecular docking
Medical MRI Geological modeling
Weather simulation
Computer virus detection
• = “Compute Unified Device Architecture” • extends the ANSI C standard • gentle learning curve (compared to Cg, HLSL, etc.) • opens the underlying architecture to the user
www.nvidia.com/getcuda (driver + toolkit + SDK + docs + …)
Initially - use graphic calls: • Cg by NVIDIA • HLSL by Microsoft Nowadays - rich GPU programming ecosystem: • ATI Stream by AMD • CUDA by NVIDIA • OpenCL by Khronos Group • Direct Computing by Microsoft
( NVIDIA hardware specific )
CUDA: Scalable parallel programming
! Augment C/C++ with minimalist abstractions ! let programmers focus on parallel algorithms ! not mechanics of a parallel programming language
! Provide straightforward mapping onto hardware ! good fit to GPU architecture ! maps well to multi-core CPUs too
! Scale to 100s of cores & 10,000s of parallel threads ! GPU threads are lightweight — create / switch is free ! GPU needs 1000s of threads for full utilization
Key Parallel Abstractions in CUDA
! Hierarchy of concurrent threads
! Lightweight synchronization primitives
! Shared memory model for cooperating threads
Hierarchy of concurrent threads
! Parallel kernels composed of many threads ! all threads execute the same sequential program
! Threads are grouped into thread blocks ! threads in the same block can cooperate
! Threads/blocks have unique IDs
Thread t
t0 t1 … tB
Block b
CUDA Model of Parallelism
! CUDA virtualizes the physical hardware ! thread is a virtualized scalar processor (registers, PC, state) ! block is a virtualized multiprocessor (threads, shared mem.)
! Scheduled onto physical hardware without pre-emption ! threads/blocks launch & run to completion ! blocks should be independent
• • • Block Memory Block Memory
Global Memory
Heterogeneous Computing
Multicore CPU
C for CUDA ! Philosophy: provide minimal set of extensions necessary to expose power
! Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { }
! Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32];
! Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel
! Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization
CUDA PROGRAMMING BASICS
Outline of CUDA Basics
! Basic Kernels and Execution on GPU ! Basic Memory Management ! Coordinating CPU and GPU Execution
! See the Programming Guide for the full API
CUDA Programming Model
! Parallel code (kernel) is launched and executed on a device by many threads
! Launches are hierarchical ! Threads are grouped into blocks ! Blocks are grouped into grids
! Familiar serial code is written for a thread ! Each thread is free to execute a unique code
path ! Built-in thread and block ID variables
High Level View
SMEM
SMEM
SMEM
SMEM
Global Memory CPU Chipset
PCIe
Blocks of threads run on an SM
Thread
Memory
Threadblock
Per-block Shared Memory
SMEM
Streaming Processor Streaming Multiprocessor
Registers
Memory
Whole grid runs on GPU
Many blocks of threads
. . .
SMEM
SMEM
SMEM
SMEM
Global Memory
Thread Hierarchy
! Threads launched for a parallel section are partitioned into thread blocks ! Grid = all blocks for a given launch
! Thread block is a group of threads that can: ! Synchronize their execution ! Communicate via shared memory
Memory Model
Kernel 0
. . . Per-device
Global Memory
. . .
Kernel 1
Sequential Kernels
Memory Model
Device 0 memory
Device 1 memory
Host memory cudaMemcpy()
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Device Code
Example: Vector Addition Kernel
// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }
int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }
Host Code
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
Example: Host code for vecAdd (2)
// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float),
cudaMemcpyDeviceToHost) ); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
IDs and Dimensions ! Threads: ! 3D IDs, unique within a block
! Blocks: ! 2D IDs, unique within a grid
! Dimensions set at launch ! Can be unique for each grid
! Built-in variables: ! threadIdx, blockIdx ! blockDim, gridDim
Device Grid 1
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Block (1, 1)
Thread (0, 1)
Thread (1, 1)
Thread (2, 1)
Thread (3, 1)
Thread (4, 1)
Thread (0, 2)
Thread (1, 2)
Thread (2, 2)
Thread (3, 2)
Thread (4, 2)
Thread (0, 0)
Thread (1, 0)
Thread (2, 0)
Thread (3, 0)
Thread (4, 0)
__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; }
Kernel Variations and Output
Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Code executed on GPU ! C/C++ with some restrictions: ! Can only access GPU memory ! No variable number of arguments ! No static variables ! No recursion ! No dynamic polymorphism
! Must be declared with a qualifier: ! __global__ : launched by CPU, cannot be called from GPU must return void ! __device__ : called from other GPU functions, cannot be called by the CPU ! __host__ : can be called by CPU ! __host__ and __device__ qualifiers can be combined
! Function is compiled for both host and device ! Sample use: overloading operators
Memory Spaces
! CPU and GPU have separate memory spaces ! Data is moved across PCIe bus ! Use functions to allocate/set/copy memory on
GPU ! Very similar to corresponding C functions
! Pointers are just addresses ! Can’t tell from the pointer value whether the
address is on CPU or GPU ! Must exercise care when dereferencing: ! Dereferencing CPU pointer on GPU will likely crash ! Same for vice versa
CUDA Device Memory Model
! Device code can: ! R/W per-thread registers ! R/W per-thread local memory ! R/W per-block shared memory ! R/W per-grid global memory ! Read only per-grid constant memory
! Host code can: ! Transfer data to/from per-grid global and
constant memories
GPU Memory Allocation / Release
! Host (CPU) manages device (GPU) memory: ! cudaMalloc (void ** pointer, size_t nbytes) ! cudaMemset (void * pointer, int value, size_t
count) ! cudaFree (void* pointer)
int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);
Data Copies
! cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); ! returns after the copy is complete ! blocks CPU thread until all bytes have been copied ! doesn’t start copying until previous CUDA calls complete
! enum cudaMemcpyKind ! cudaMemcpyHostToDevice ! cudaMemcpyDeviceToHost ! cudaMemcpyDeviceToDevice
! Non-blocking copies are also available ! cudaMemcpyAsync
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }
Example: Shuffling Data
// Reorder values based on keys // Each thread moves one element __global__ void shuffle(int* prev_array, int*
new_array, int* indices) { int i = threadIdx.x + blockDim.x * blockIdx.x; new_array[i] = prev_array[indices[i]]; }
int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }
Host Code
__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }
Kernel with 2D Indexing
int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; }
__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }
Blocks must be independent
! Any possible interleaving of blocks should be valid ! presumed to run to completion without pre-
emption ! can run in any order ! can run concurrently OR sequentially
! A thread block is a batch of threads that can cooperate with each other by: ! synchronizing their execution: _syncthreads() ! sharing data with shared memory
! Independence requirement gives scalability
CUDA MEMORIES
Hardware Implementation of CUDA Memories
! Each thread can: ! Read/write per-thread
registers ! Read/write per-thread
local memory ! Read/write per-block
shared memory ! Read/write per-grid
global memory ! Read/only per-grid
constant memory
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
CUDA Variable Type Qualifiers
! “automatic” scalar variables without qualifier reside in a register ! compiler will spill to thread local memory
! “automatic” array variables without qualifier reside in thread-local memory
Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application
CUDA Variable Type Performance
! scalar variables reside in fast, on-chip registers ! shared variables reside in fast, on-chip memories ! thread-local arrays & global variables reside in
uncached off-chip memory ! constant variables reside in cached off-chip memory
Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x
Where to declare variables?
Can host access it?
Outside of any function
In the kernel
Yes No
__constant__ int constant_var;
__device__ int global_var;
int var;
int array_var[10]; __shared__ int shared_var;
Example – thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
Two loads
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
// once by thread i // again by thread i+1
Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }
Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }
Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; } }
Example – shared variables // when the size of the array isn’t known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);
A Common Programming Strategy
! Partition data into subsets that fit into shared memory
A Common Programming Strategy
! Handle each data subset with one thread block
A Common Programming Strategy
! Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
A Common Programming Strategy
! Perform the computation on the subset from shared memory
A Common Programming Strategy
! Copy the result from shared memory back to global memory
A Common Programming Strategy
! Carefully partition data according to access patterns ! Read-only è __constant__ memory (fast) ! R/W & shared within block è __shared__ memory
(fast) ! R/W within each thread è registers (fast) ! Indexed R/W within each thread è local memory
(slow) ! R/W inputs/results è cudaMalloc‘ed global
memory (slow)
Communication Through Memory
! Question:
__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }
Communication Through Memory
! This is a race condition ! The result is undefined ! The order in which threads access the variable is
undefined without explicit coordination ! Use barriers (e.g., __syncthreads) or atomic
operations (e.g., atomicAdd) to enforce well-defined semantics
Communication Through Memory
! Use __syncthreads to ensure data is ready for access
__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }
Communication Through Memory
! Use atomic operations to ensure exclusive access to a variable
// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }
Hierarchical Atomics
! Divide & Conquer ! Per-thread atomicAdd to a __shared__ partial sum ! Per-block atomicAdd to the total sum
Σ
Σ0 Σ1 Σι
SMX EXECUTION
How an SM executes threads
! Overview of how a Stream Multiprocessor works
! SIMT Execution ! Divergence
Scheduling Blocks onto SMs
Thread Block 5
Thread Block 27
Thread Block 61
Streaming Multiprocessor
Thread Block 2001
! HW Schedules thread blocks onto available SMs ! No guarantee of ordering among thread blocks ! HW will schedule thread blocks as soon as a previous
thread block finishes
Warps
! Each thread block is executed as 32-thread warps ! An implementation decision, not part of the
CUDA programming model ! Warps are scheduling units in SMs
! If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps – There are 8 x 3 = 24 Warps
Mapping of Thread Blocks
! Each thread block is mapped to one or more warps
! The hardware schedules each warp independently
Thread Block N (128 threads)
TB N W1
TB N W2
TB N W3
TB N W4
24
Thread Scheduling Example
! SM implements zero-overhead warp scheduling ! At any time, only one of the warps is executed by
SM ! Warps whose next instruction has its inputs
ready for consumption are eligible for execution ! Eligible Warps are selected for execution on a
prioritized scheduling policy ! All threads in a warp execute the same
instruction when selected
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
Control Flow Divergence
! What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B(); }
Control Flow Divergence
Branch
Path A
Path B
Branch
Path A
Path B
Control Flow Divergence
! Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } else do_C();
Control Flow Divergence
Branch Branch
Path A
Path C
Branch
Path B
Control Flow Divergence
! You don’t have to worry about divergence for correctness
! You might have to think about it for performance ! Depends on your branch conditions
Control Flow Divergence
! Performance drops off with the degree of divergence
switch(threadIdx.x % N) { case 0: ... case 1: ... }
Divergence
0
5
10
15
20
25
30
35
0 2 4 6 8 10 12 14 16 18
Perf
orm
ance
Divergence
CUDA
.c / .cpp Host code
.gpu Device code
Object File Fat bin
C/C++ Libraries
CUDA Libraries
Executable (with embedded GPU code)
Compiling a CUDA program
CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) clean: $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core
OPENACC API
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in” Acceleration
Programming Languages
OpenACC Directives
Maximum Flexibility
Easily Accelerate Applications
OpenACC Directives
Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience
CPU GPU
Your original Fortran or C
code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs & multicore CPUs
OpenACC Compiler
Hint
OpenACC Open Programming Standard for Parallel Computing
Easy: Directives are the easy path to accelerate compute intensive applications
Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors
Powerful: GPU Directives allow complete access to the massive parallel power of a GPU
OpenACC The Standard for GPU Directives
High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran
OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers
Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator
Programming model allows programmers to start simple
Enhance with additional guidance for compiler on loop mappings, data location, and other performance details
Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.
OpenACC Specification and Website
Full OpenACC 1.0 Specification available online
http://www.openacc-standard.org
Quick reference card also available Available compilers:
Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)
subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran
Directive Syntax
Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive
C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block
CUDA = much more
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL, University of Illinois, Urbana-Champaign
20
• Atomic, shuffle operations • Streams • GPU Direct • CUBLAS, NPP, CUFFT, … • OpenCL
References • www.nvidia.com/cuda (CUDA zone, developer zone) • Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu • CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot • CUDA Application Design and Development by Rob Farber
Full Version Questions ?
! Generalize adjacent_difference example
! AB = A * B ! Each element ABij ! = dot(row(A,i),col(B,j))
! Parallelization strategy ! Thread à ABij
! 2D kernel
Matrix Multiplication Example
First Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }
Idea: Use __shared__ memory to reuse global data
! Each input element is read by width threads
! Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth
width
Tiled Multiply
! Partition kernel loop into phases
! Load a tile of both matrices into __shared__ each phase
! Each phase, each thread computes a partial result
TILE_WIDTH
Better Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;
Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) {
// collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }
Use of Barriers in mat_mul
! Two barriers per phase: ! __syncthreads after all data is loaded into __shared__
memory ! __syncthreads after all data is read from __shared__
memory ! Note that second __syncthreads in phase p guards the
load in phase p+1
! Use barriers to guard data ! Guard against using uninitialized data ! Guard against bashing live data
First Order Size Considerations
! Each thread block should have many threads ! TILE_WIDTH = 16 à 16*16 = 256 threads
! There should be many thread blocks ! 1024*1024 matrices à 64*64 = 4096 thread blocks ! TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads ! Full occupancy
! Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op) ! Compare to 4B/op
TILE_SIZE Effects
Memory Resources as Limit to Parallelism
! Effective use of different memory resources reduces the number of accesses to global memory
! These resources are finite! ! The more memory locations each thread requires à
the fewer threads an SM can accommodate
Resource Per GTX480 SM Full Occupancy on GTX480
Registers 32768 <= 32768 / 1024 threads = 32 per thread
__shared__ Memory 48KB <= 48KB / 8 blocks = 6KB per block
top related