cuda c/c++ basics (cont.) -...
TRANSCRIPT
CUDAC/C++BASICS(cont.)
NVIDIACorpora7on
© NVIDIA 2013
COOPERATING THREADS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
1DStencil
• Considerapplyinga1Dstenciltoa1Darrayofelements– Eachoutputelementisthesumofinputelementswithinaradius
• Ifradiusis3,theneachoutputelementisthesumof7inputelements:
© NVIDIA 2013
radius radius
Implemen7ngWithinaBlock
• Eachthreadprocessesoneoutputelement– blockDim.xelementsperblock
• Inputelementsarereadseveral7mes– Withradius3,eachinputelementisreadseven7mes
© NVIDIA 2013
SharingDataBetweenThreads
• Terminology:withinablock,threadssharedataviasharedmemory
• Extremelyfaston-chipmemory,user-managed
• Declareusing__shared__,allocatedperblock
• Dataisnotvisibletothreadsinotherblocks
© NVIDIA 2013
Implemen7ngWithSharedMemory
• Cachedatainsharedmemory– Read(blockDim.x+2*radius)inputelementsfromglobalmemorytosharedmemory
– ComputeblockDim.xoutputelements
– WriteblockDim.xoutputelementstoglobalmemory
– Eachblockneedsahaloofradiuselementsateachboundary
blockDim.x output elements (16 in the example)
halo on left halo on right
© NVIDIA 2013
__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }
© NVIDIA 2013
StencilKernel
// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];
// Store the result out[gindex] = result; }
StencilKernel
© NVIDIA 2013
DataRace!
© NVIDIA 2013
! Thestencilexamplewillnotwork…
! Supposethread15readsthehalobeforethread0hasfetchedit…
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex – RADIUS = in[gindex – RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
result += temp[lindex + 1];
Store at temp[18]
Load from temp[19]
Skipped, threadIdx > RADIUS
__syncthreads()
• void __syncthreads();
• Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards
• Allthreadsmustreachthebarrier– Incondi7onalcode,thecondi7onmustbeuniformacrosstheblock
© NVIDIA 2013
StencilKernel__global__ void stencil_1d(int *in, int *out) { __shared__ int temp[BLOCK_SIZE + 2 * RADIUS]; int gindex = threadIdx.x + blockIdx.x * blockDim.x; int lindex = threadIdx.x + radius;
// Read input elements into shared memory temp[lindex] = in[gindex]; if (threadIdx.x < RADIUS) { temp[lindex – RADIUS] = in[gindex – RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }
// Synchronize (ensure all the data is available) __syncthreads();
© NVIDIA 2013
StencilKernel
// Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset];
// Store the result out[gindex] = result; }
© NVIDIA 2013
Review(1of2)
• Launchingparallelthreads– LaunchNblockswithMthreadsperblockwith
kernel<<<N,M>>>(…); – UseblockIdx.xtoaccessblockindexwithingrid– UsethreadIdx.xtoaccessthreadindexwithinblock
• Allocateelementstothreads:
int index = threadIdx.x + blockIdx.x * blockDim.x;
© NVIDIA 2013
Review(2of2)
• Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks
• Use__syncthreads()asabarrier– Usetopreventdatahazards
© NVIDIA 2013
MANAGING THE DEVICE
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013
Coordina7ngHost&Device
• Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately
• CPUneedstosynchronizebeforeconsumingtheresults
cudaMemcpy() BlockstheCPUun7lthecopyiscompleteCopybeginswhenallprecedingCUDAcallshavecompleted
cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU
cudaDeviceSynchronize() BlockstheCPUun7lallprecedingCUDAcallshavecompleted
© NVIDIA 2013
Repor7ngErrors
• AllCUDAAPIcallsreturnanerrorcode(cudaError_t)– ErrorintheAPIcallitself
OR– Errorinanearlierasynchronousopera7on(e.g.kernel)
• Gettheerrorcodeforthelasterror: cudaError_t cudaGetLastError(void)
• Getastringtodescribetheerror: char *cudaGetErrorString(cudaError_t)
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013
DeviceManagement
• Applica7oncanqueryandselectGPUs cudaGetDeviceCount(int *count) cudaSetDevice(int device) cudaGetDevice(int *device) cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Mul7plethreadscanshareadevice
• Asinglethreadcanmanagemul7pledevices cudaSetDevice(i)toselectcurrentdevice cudaMemcpy(…)forpeer-to-peercopies✝
✝ requires OS and device support
© NVIDIA 2013
Introduc7ontoCUDAC/C++
• Whathavewelearned?– WriteandlaunchCUDAC/C++kernels
• __global__,blockIdx.x,threadIdx.x,<<<>>>
– ManageGPUmemory• cudaMalloc(),cudaMemcpy(),cudaFree()
– Managecommunica7onandsynchroniza7on• __shared__,__syncthreads()
• cudaMemcpy()vscudaMemcpyAsync(),cudaDeviceSynchronize()
© NVIDIA 2013
ComputeCapability
• Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.– Numberofregisters
– Sizesofmemories
– Features&capabili7es
• Thefollowingpresenta7onsconcentrateonFermidevices– ComputeCapability>=2.0
ComputeCapability
SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)
Teslamodels
1.0 FundamentalCUDAsupport 870
1.3 Doubleprecision,improvedmemoryaccesses,atomics
10-series
2.0 Caches,fusedmul7ply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,func7onpointers,recursion
20-series
© NVIDIA 2013
IDsandDimensions
– Akernelislaunchedasagridofblocksofthreads
• blockIdxandthreadIdxare3D
• Weshowedonlyonedimension(x)
• Built-invariables:– threadIdx – blockIdx – blockDim – gridDim
Device
Grid 1
Block (0,0,0)
Block (1,0,0)
Block (2,0,0)
Block (1,1,0)
Block (2,1,0)
Block (0,1,0)
Block (1,1,0) Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
Thread
(4,0,0)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(4,1,0)
Thread
(0,2,0)
Thread
(1,2,0)
Thread
(2,2,0)
Thread
(3,2,0)
Thread
(4,2,0)
© NVIDIA 2013
Topicsweskipped
• Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide
– CUDAZone–tools,training,webinarsandmoredeveloper.nvidia.com/cuda
© NVIDIA 2013