cuda materials

Course Outline

� GPU Architecture Overview (this module)

� Tools of the Trade

� Introduction to CUDA C

� Patterns of Parallel Computing

� Thread Cooperation and Synchronization

� The Many Types of Memory

� Atomic Operations

� Events and Streams

� CUDA in Advanced Scenarios

Note: to follow this course, you require a CUDA-capable GPU. urse, you require a CUDA cap

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

History of GPU Computation

No GPU CPU handled graphic output

Dedicated GPU Separate graphics card (PCI, AGP)

Programmable GPU Shaders

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Shaders

� Small program that runs on the GPU

� Types

� Vertex (used to calculate location of a vertex) � Pixel (used to calculate the color components of a single pixel)

� Shader languages

� High Level Shader Language (HLSL, Microsoft DirectX) � OpenGL Shading Language (GLSL, OpenGL) � Both are C-like � Both are intended for graphics (i.e., not general-purpose)

� Pixel shaders used for math

� Convert data to texture � Run texture through pixel shader � Get result texture and convert to data

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Why GPGPU?

� General-Purpose Computation on GPUs

� Highly parallel architecture

� Lots of concurrent threads of execution (SIMT) � Higher throughput compared to CPUs

� Even taking into account many cores, hypethreading, SIMD

� Thus more FLOPS (floating-point operations per second)

� Commodity hardware

� Commonly available (mainly used by gamers) � Relatively cheap compared to custom solutions (e.g., FPGAs)

� Sensible programming model

� Manufacturers realized GPGPU market potential � Graphics made optional � NVIDIA offers dedicated GPU platform “Tesla”

� No output connections

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

GPGPU Frameworks

� Compute Unified Driver Architecture (CUDA)

� Developed by NVIDIA Corporation � Extensions to programming languages (C/C++)� Wrappers for other languages/platforms

(e.g., FORTRAN, PyCUDA, MATLAB)

� Open Computing Language (OpenCL)

� Supported by many manufacturers (inc. NVIDIA) � The high-level way to perform computation on ATI devices

� C++ Accelerated Massive Programming (AMP)

� C++ superset � A standard by Microsoft, part of MSVC++ � Supports both ATI and NVIDIA

� Other frameworks and cross-compilers

� Alea, Aparapi, Brook, etc.

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Graphics Processor Architecture

� Warning: NVIDIA terminology ahead

� Streaming Multiprocessor (SM)

� Contains several CUDA cores � Can have >1 SM on card

� CUDA Core (a.k.a. Streaming Processor, Shader Unit)

� # of cores per SM tied to compute capability

� Different types of memory

� Means of access � Performance characteristics

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy


SM2

SM1

SP1 SP2 SP3

Device Memory

Registers

Shared Memory

SP4

Constant Memory

Texture Memory

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Compute Capability

� A number indicating what the card can do

� Current range: 1.0, 1.x, 2.x, 3.0, 3.5

� Affects both hardware and API support

� Number of CUDA cores per SM � Max # of 32-bit registers per SM � Max # of instructions per kernel � Support for double-precision ops � … and many other parameters

� Higher is better

� See http://en.wikipedia.org/wiki/CUDA

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Choosing a Graphics Card

� Look at performance specs (peak flops)

� Pay attention to single vs. double � E.g. http://www.nvidia.com/object/tesla-servers.html

� Number of GPUs

� Compute capability/architecture

� COTS cheaper than dedicated

� Memory size

� Ensure PSU/motherboard is good enough to handle the card(s)

� Can have >1 graphics card

� YMMV (PCI saturation)

� Can mix architectures (e.g. NVIDIA+ATI)

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Overview

� Windows, Mac OS, Linux

� This course uses Windows/Visual Studio� This course focuses on desktop GPU development

� CUDA Toolkit

� LLVM-based compiler� Headers & libraries� Documentation� Samples

� NSight

� Visual Studio: plugin that allows debugging� Eclipse IDE

� http://developer.nvidia.com/cuda

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

CUDA Tools

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

What is NSight?

� Many developers, few GPUs

� No need for each dev to have GPU

� Client-server architecture

� Server with GPU (target)

� Runs Nsight service� Accepts connections from developers

� Clients with Visual Studio + NSight plugin (host)

� Use NSight to connect to the server to run applicationand debug GPU code

� Caveats

� Need to use VS Remote Debugging to debug CPU code� Need your own mechanism for syncing output

Target GPU(s)NSight Service

Visual Studio w/NSight

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Overview

� Compilation Process

� Obligatory “Hello Cuda” demo

� Location Qualifiers

� Execution Model

� Grid and Block Dimensions

� Error Handling

� Device Introspection

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

NVidia Cuda Compiler (nvcc)

� nvcc is used to compile CUDA-specific code

� Not a compiler!� Uses host C or C++ compiler (MSVC, GCC)

� Some aspects are C++ specific

� Splits code into GPU and non-GPU parts

� Host code passed to native compiler

� Accepts project-defined GPU-specific settings

� E.g., compute capability

� Translates code written in CUDA C into PTX

� Graphics driver turns PTX into binary code

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

NVCC Compilation Process

__global__ void a(){}

void b(){}

__global__ void a(){}

nvcc

void b(){}

Executable

Host compiler(e.g. MSVC’s cl.exe)

nvcc

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Parallel Thread Execution (PTX)

� PTX is the ‘assembly language’ of CUDA

� Similar to .NET IL or Java bytecode� Low-level GPU instructions

� Can be generated from a project

� Typically useful to compiler writers

� E.g., GPU Ocelot https://code.google.com/p/gpuocelot/

� Inline PTX (asm)

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Location Qualifiers

� __global__Defines a kernel.

Runs on the GPU, called from the CPU.

Executed with <<<dim3>>> arguments.

� __device__Runs on the GPU, called from the GPU.

� Can be used for variables too

� __host__Runs on the CPU, called from the CPU.

� Qualifiers can be mixed

� E.g. __host__ __device__ foo()� Code compiled for both CPU and GPU� Useful for testing

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Execution Model

doSomething<<<2,3>>>()

doSomethingElse<<<4,2>>>()

thread

thread block

grid

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Execution Model

� Thread blocks are scheduled to run on available SMs

� Each SM executes one block at a time

� Thread block is divided into warps

� Number of threads per warp depends on compute capability

� All warps are handled in parallel

� CUDA Warp Watch

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Dimensions

� We defined execution as

� <<<a,b>>>� A grid of a blocks of b threads each� The grid and each block are 1D structures

� In reality, these constructs are 3D

� A 3-dimensional grid of 3-dimensional blocks� You can define (a×b×c) blocks of (x×y×z) threads� Can have 2D or 1D by setting extra dimensions to 1

� Defined in dim3 structure

� Simple container with x, y and z values.� Some constructors defined for C++� Automatic conversion for

<<<a,b>>> � (a,1,1) by (b,1,1)

grid

block

thread

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Thread Variables

� Execution parameters & current position

� blockIdx

� Where we are in the grid

� gridDim

� The size of the grid

� threadIdx

� Position of current thread in thread block

� blockDim

� Size of thread block

� Limitations

� Grid & block sizes� # of threads

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Error Handling

� CUDA does not throw

� Silent failure

� Core functions return cudaError_t� Can check against cudaSuccess� Get description with cudaGetErrorString()

� Libraries may have different error types

� E.g. cuRAND has curandStatus_t

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Rules of the Game

� Different types of memory

� Shared vs. private� Access speeds

� Data is in arrays

� No parallel data structures� No other data structures� No auto-parallelization/vectorization compiler support� No CPU-type SIMD equivalent

� Compiler constraint

� No C++11 support

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Overview

� Data Access

� Map

� Gather

� Reduce

� Scan

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Data Access

� A real problem!

� Thread space can be up to 6D

� 3D grid of 3D thread blocks

� Input space typically 1D

� 2D arrays are possible

� Need to map threads to inputs

� Some examples

� 1 block, N threads � threadIdx.x� 1 block, MxN threads � threadIdx.y * blockDim.x + threadIdx.x� N blocks, M threads � blockIdx.x * gridDim.x + threadIdx.x� … and so on

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Map

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Gather

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Black Scholes Formula

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Reduce

++

+

+ +

+

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Reduce in Practice

� Adding up N data elements � Adding up N data elements

� Use 1 block of N/2 threads

� Each thread does

x[i] += x[j];� At each step

� # of threads halved� Distance (j-i) doubled

� x[0] is the result

1 2 3 4 5 6 7 8

3 7 11 153

0

7

1

11

2

155

3

10 260100

0

62626

1

3663636

0

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Scan

4 2 5 3 6

4 6 11 14 20

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Scan in Practice

� Similar to reduce

1 2 3 4

1 3 5 73

0

5

1

1 3 6 106

0

1 3 6 10

0

7

2

10

1

� Require N-1 threads

� Step size keeps doubling

� Number of threads reduced

by step size

� Each thread n does

x[n+step] += x[n];

5

3

9

2

000000 14

000 15

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy


SM2

SM1

SP1 SP2 SP3

Device Memory

Registers

Shared Memory

SP4

Constant Cache

Texture CacheT

C

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Device Memory

� Grid scope (i.e., available to all threads in all blocks in the grid)

� Application lifetime (exists until app exits or explicitly deallocated)

� Dynamic

� cudaMalloc() to allocate� Pass pointer to kernel� cudaMemcpy() to copy to/from host memory� cudaFree() to deallocate

� Static

� Declare global variable as device__device__ int sum = 0;

� Use freely within the kernel� Use cudaMemcpy[To/From]Symbol() to copy to/from host memory� No need to explicitly deallocate

� Slowest and most inefficient

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Constant & Texture Memory

� Read-only: useful for lookup tables, model parameters, etc.

� Grid scope, Application lifetime

� Resides in device memory, but

� Cached in a constant memory cache

� Constrained by MAX_CONSTANT_MEMORY� Expect 64kb

� Similar operation to statically-defined device memory

� Declare as __constant__� Use freely within the kernel� Use cudaMemcpy[To/From]Symbol() to copy to/from host memory

� Very fast provided all threads read from the same location

� Used for kernel arguments

� Texture memory: similar to Constant, optimized for 2D access

patterns

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Shared Memory

� Block scope

� Shared only within a thread block� Not shared between blocks

� Kernel lifetime

� Must be declared within the kernel function body

� Very fast

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Register & Local Memory

� Memory can be allocated right within the kernel

� Thread scope, Kernel lifetime

� Non-array memory

� int tid = …� Stored in a register� Very fast

� Array memory

� Stored in ‘local memory’� Local memory is an abstraction, actually put in global memory� Thus, as slow as global memory

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Summary

Declaration Memory Scope Lifetime Slowdown

int foo; Register Thread Kernel 1x

int foo[10]; Local Thread Kernel 100x

__shared__ int foo; Shared Block Kernel 1x

__device__ int foo; Global Grid Application 100x

__constant__ int foo; Constant Grid Application 1x

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Thread Synchronization

� Threads can take different amounts

of time to complete a part of a

computation

� Sometimes, you want all threads to reach

a particular point before continuing

their work

� CUDA provides a thread barrier function

__syncthreads() � A thread that calls __syncthreads() waits

for other threads to reach this line

� Then, all threads can continue executing

barrier

A B C

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Restrictions

� __syncthreads() only synchronizes threads within a block

� A thread that calls __syncthreads() waits for other threads to reach

this location

� All threads must reach __syncthreads() � if (x > 0.5)

{ __syncthreads(); // bad idea }

� Each call to __syncthreads() is unique

� if (x > 0.5) { __syncthreads(); } else { __syncthreads(); // also a bad idea }

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Branch Divergence

� Once a block is assigned to an

SM, it is split into several

warps

� All threads within a warp

must execute the same

instruction at the same time

� I.e., if and else branches

cannot be executed

concurrently

� Avoid branching if possible

� Ensure if statements cut on

warp boundaries

branch

performance loss

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Summary

� Threads are synchronized with __syncthreads()

� Block scope � All threads must reach the barrier � Each __syncthreads() creates a unique barrier

� A block is split into warps

� All threads in a warp execute same instruction � Branching leads to warp divergence (performance loss) � Avoid branching or ensure branch cases fall in different warps

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Overview

� Why atomics?

� Atomic Functions

� Atomic Sum

� Monte Carlo Pi

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Why Atomics?

� x++ is a read-modify-write operation

� Read x into a register � Increment register value � Write register back into x � Effectively { temp = x; temp = temp+1; x = temp; }

� If two threads do x++ � Each thread has its own temp (say t1 and t2) � { t1 = x; t1 = t1+1; x = t1;}

{ t2 = x; t2 = t2+1; x = t2;} � Race condition: the thread that writes to x first wins � Whoever wins, x gets incremented only once

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Atomic Functions

� Problem: many threads accessing the same memory location

� Atomic operations ensure that only one thread can access the location

� Grid scope!

� atomicOP(x,y) � t1 = *x; // read � t2 = t1 OP y; // modify � *a = t2; // write

� Atomics need to be configured

� #include "sm_20_atomic_functions.h“ �

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Monte Carlo Pi

r

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Summary

� Atomic operations ensure operations on a variable cannot be

interrupted by a different thread

� CUDA supports several atomic operations

� atomicAdd() � atomicOr() � atomicMin() � … and others

� Atomics incur a heavy performance penalty

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Overview

� Events

� Event API

� Event example

� Pinned memory

� Streams

� Stream API

� Example (single stream)

� Example (multiple streams)

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Events

� How to measure performance?

� Use OS timers

� Too much noise

� Use profiler

� Times only kernel duration + other invocations

� CUDA Events

� Event = timestamp � Timestamp recorded on the GPU � Invoked from the CPU side

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Event API

� cudaEvent_t � The event handle

� cudaEventCreate(&e) � Creates the event

� cudaEventRecord(e, 0) � Records the event, i.e. timestamp � Second param is the stream to which to record

� cudaEventSynchronize(e) � CPU and GPU are async, can be doing things in parallel � cudaEventSynchronize() blocks all instruction processing until the GPU

has reached the event

� cudaEventElapsedTime(&f, start, stop) � Computes elapsed time (msec) between start and stop, stored as float

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Pinned Memory

� CPU memory is pageable

� Can be swapped to disk

� Pinned (page-locked) stays in place

� Performance advantage when copying to/from GPU

� Use cudaHostAlloc() instead of malloc() or new � Use cudaFreeHost() to deallocate

� Cannot be swapped out

� Must have enough � Proactively deallocate

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Streams

� Remember cudaEventRecord(event, stream)?

� A CUDA stream is a queue of GPU operations

� Kernel launch � Memory copy

� Streams allow a form of task-based parallelism

� Performance improvement

� To leverage streams you need device overlap support

� GPU_OVERLAP

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Stream API

� cudaStream_t � cudaStreamCreate(&stream) � kernel<<<blocks,threads,shared,stream>>> � cudaMemcpyAsync()

� Must use pinned memory!

� stream parameter

� cudaStreamSynchronize(stream)

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

Summary

� CUDA events let you time your code on the GPU

� Pinned memory speeds up data transfers to/from device

� CUDA streams allow you to queue up operations asynchronously

� Lets you do different things in parallel on the GPU � Use of pinned memory is required

CUDA Teach

ing C

enter

, Dep

artmen

t of C

ompu

ter A

pplic

ation

s, NIT, T

richy

cuda materials

Documents

teaching c

ati c u d

data c u d

orgwikicuda c u d

comcuda c u d

cuda tools c u d

cuda cap c u d

output connections c