cuda c/c++ basics (cont.) -...

CUDAC/C++BASICS(cont.)

NVIDIACorpora7on

© NVIDIA 2013

COOPERATING THREADS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

1DStencil

•  Considerapplyinga1Dstenciltoa1Darrayofelements–  Eachoutputelementisthesumofinputelementswithinaradius

•  Ifradiusis3,theneachoutputelementisthesumof7inputelements:

© NVIDIA 2013

radius radius

Implemen7ngWithinaBlock

•  Eachthreadprocessesoneoutputelement–  blockDim.xelementsperblock

•  Inputelementsarereadseveral7mes– Withradius3,eachinputelementisreadseven7mes

© NVIDIA 2013

SharingDataBetweenThreads

•  Terminology:withinablock,threadssharedataviasharedmemory

•  Extremelyfaston-chipmemory,user-managed

•  Declareusing__shared__,allocatedperblock

•  Dataisnotvisibletothreadsinotherblocks

© NVIDIA 2013

Implemen7ngWithSharedMemory

•  Cachedatainsharedmemory–  Read(blockDim.x+2*radius)inputelementsfromglobalmemorytosharedmemory

–  ComputeblockDim.xoutputelements

– WriteblockDim.xoutputelementstoglobalmemory

–  Eachblockneedsahaloofradiuselementsateachboundary

blockDim.x output elements (16 in the example)

halo on left halo on right

© NVIDIA 2013

__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

© NVIDIA 2013

StencilKernel

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

StencilKernel

© NVIDIA 2013

DataRace!

© NVIDIA 2013

!  Thestencilexamplewillnotwork…

!  Supposethread15readsthehalobeforethread0hasfetchedit…

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS = in[gindex – RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

int result = 0;

result += temp[lindex + 1];

Store at temp[18]

Load from temp[19]

Skipped, threadIdx > RADIUS

__syncthreads()

•  void __syncthreads();

•  Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards

•  Allthreadsmustreachthebarrier–  Incondi7onalcode,thecondi7onmustbeuniformacrosstheblock

© NVIDIA 2013

StencilKernel__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius;

// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

// Synchronize (ensure all the data is available) __syncthreads();

© NVIDIA 2013

StencilKernel

// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];

// Store the result out[gindex] = result; }

© NVIDIA 2013

Review(1of2)

•  Launchingparallelthreads– LaunchNblockswithMthreadsperblockwith

kernel<<<N,M>>>(…); – UseblockIdx.xtoaccessblockindexwithingrid– UsethreadIdx.xtoaccessthreadindexwithinblock

•  Allocateelementstothreads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013

Review(2of2)

•  Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks

•  Use__syncthreads()asabarrier– Usetopreventdatahazards

© NVIDIA 2013

MANAGING THE DEVICE

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Coordina7ngHost&Device

•  Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately

•  CPUneedstosynchronizebeforeconsumingtheresults

cudaMemcpy() BlockstheCPUun7lthecopyiscompleteCopybeginswhenallprecedingCUDAcallshavecompleted

cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU

cudaDeviceSynchronize() BlockstheCPUun7lallprecedingCUDAcallshavecompleted

© NVIDIA 2013

Repor7ngErrors

•  AllCUDAAPIcallsreturnanerrorcode(cudaError_t)–  ErrorintheAPIcallitself

OR–  Errorinanearlierasynchronousopera7on(e.g.kernel)

•  Gettheerrorcodeforthelasterror: cudaError_t cudaGetLastError(void)

•  Getastringtodescribetheerror: char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013

DeviceManagement

•  Applica7oncanqueryandselectGPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

•  Mul7plethreadscanshareadevice

•  Asinglethreadcanmanagemul7pledevices cudaSetDevice(i)toselectcurrentdevice cudaMemcpy(…)forpeer-to-peercopies✝

✝ requires OS and device support

© NVIDIA 2013

Introduc7ontoCUDAC/C++

•  Whathavewelearned?– WriteandlaunchCUDAC/C++kernels

•  __global__,blockIdx.x,threadIdx.x,<<<>>>

– ManageGPUmemory•  cudaMalloc(),cudaMemcpy(),cudaFree()

– Managecommunica7onandsynchroniza7on•  __shared__,__syncthreads()

•  cudaMemcpy()vscudaMemcpyAsync(),cudaDeviceSynchronize()

© NVIDIA 2013

ComputeCapability

•  Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.–  Numberofregisters

–  Sizesofmemories

–  Features&capabili7es

•  Thefollowingpresenta7onsconcentrateonFermidevices–  ComputeCapability>=2.0

ComputeCapability

SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)

Teslamodels

1.0 FundamentalCUDAsupport 870

1.3 Doubleprecision,improvedmemoryaccesses,atomics

10-series

2.0 Caches,fusedmul7ply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,func7onpointers,recursion

20-series

© NVIDIA 2013

IDsandDimensions

– Akernelislaunchedasagridofblocksofthreads

•  blockIdxandthreadIdxare3D

• Weshowedonlyonedimension(x)

•  Built-invariables:–  threadIdx –  blockIdx –  blockDim –  gridDim

Device

Grid 1

Block (0,0,0)

Block (1,0,0)

Block (2,0,0)

Block (1,1,0)

Block (2,1,0)

Block (0,1,0)

Block (1,1,0) Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

Thread

(4,0,0)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(4,1,0)

Thread

(0,2,0)

Thread

(1,2,0)

Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4,2,0)

© NVIDIA 2013

Topicsweskipped

•  Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide

– CUDAZone–tools,training,webinarsandmoredeveloper.nvidia.com/cuda

© NVIDIA 2013

cuda c/c++ basics (cont.) -...

Documents