cudahome.deib.polimi.it › ... › 2017 › doc › slides › 2.hsa_cuda_v1.pdf · 2017-12-12 ·...
TRANSCRIPT
![Page 1: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/1.jpg)
AdvancedTopicsonHeterogeneousSystemArchitectures
Politecnico di Milano!N.1.1!
12 December, 2017!
Antonio R. Miele!Marco D. Santambrogio!
Politecnico di Milano!
CUDA!
![Page 2: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/2.jpg)
Introduction!• CUDA (Compute Unified Device Architecture) has
been introduced in 2007 with NVIDIA Tesla architecture!
• C-like language to write programs that are able to use GPU as a processing resource to accelerate functions!
• CUDA’s abstraction level closely match the characteristics and organization of NVIDIA GPUs!
2
![Page 3: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/3.jpg)
Structure of a program!• A CUDA program is organized in!– A serial C code executed on the CPU!– A set of independent kernel functions parallelized
on the GPU!
3
![Page 4: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/4.jpg)
Parallelization of a kernel!• The set of input data is
decomposed into a stream of elements!
• A single computational function (kernel) operates on each element!– “thread” defined as
execution of kernel on one data element!
• Multiple cores can process multiple threads in parallel!
• Suitable for data-parallel problems!
4
![Page 5: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/5.jpg)
CUDA hierarchy of threads!
5
• A kernel executes in parallel across a set of parallel threads!– Each thread executes an instance of the
kernel on a single data chuck!
• Threads are grouped in thread blocks!
• Thread blocks are grouped in a grid!
• Blocks and grids are organized as an N-dimensional (up to 3) array!– IDs are used to identify elements in the
array!
![Page 6: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/6.jpg)
CUDA hierarchy of threads!• CUDA hierarchy of threads
perfectly match with NVIDIA 2-level architecture!– A GPU executes one or
more kernel grids!– A streaming multiprocessor
(SM) executes concurrently one or more interleaved thread blocks!
– CUDA cores in the SM execute threads !
– The SM executes threads of the same block in groups of 32 threads called a warp!
6
Threads
Block
Block
![Page 7: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/7.jpg)
CUDA device memory model!
7
• There are mainly three distinct memories visible in the device-side code!
!
sequ
enAa
l
![Page 8: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/8.jpg)
CUDA device memory model!• Matching between the memory model and the NVIDIA
GPU architecture!
8
![Page 9: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/9.jpg)
Execution flow!
1. Copy input data from CPU memory to GPU memory!
9
![Page 10: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/10.jpg)
Execution flow!
2. Load GPU program and execute!
10
![Page 11: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/11.jpg)
Execution flow!
3. Transfer results from GPU memory to CPU memory!
11
![Page 12: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/12.jpg)
Example of CUDA kernel!• Compute the sum of two vectors!• C program:!
12
#define N 256 void vsum(int* a, int* b, int* c){ int i; for (i=0;i<N;i++){ c[i]=a[i]+b[i]; } } void main(){ int va[N], vb[N], vc[N]; ... vsum(va,vb,vc); ... }
![Page 13: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/13.jpg)
Example of CUDA kernel!• CUDA C program:!
13
#define N 256 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }
void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...
}
![Page 14: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/14.jpg)
Example of CUDA kernel!• CUDA C program:!
14
#define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }
void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...
}
Getthex(y,z)valueoftheblockid
Getthelengthoftheblockonthex(y,z)direcAon
Getthexvalueofthethreadidonx(y,z)direcAon
![Page 15: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/15.jpg)
Example of CUDA kernel!• CUDA C program:!
15
#define N 1024 __global__ void vsumKernel(int* a, int* b, int *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; c[i] = a[i] + b[i]; }
void main(){ ... dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly dim3 threadsPerBlock(256,1,1); vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(va,vb,vc); ...
}
Executethekernel
Numberofblocksinthegrid
Numberofthreadsineachblock
![Page 16: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/16.jpg)
Data transfer!• GPU memory is separate from CPU one!– GPU memory has to be allocated!– Data have to be transferred explicitly in GPU memory!
16
void main(){ int *va; int mya[N]; //to be populated ... cudaMalloc(&va, N*sizeof(int)); cudaMemcpy(va, mya, N*sizeof(int), cudaMemcpyHostToDevice); ... //execute kernel cudaMemcpy(mya, va, N*sizeof(int), cudaMemcpyDeviceToHost); cudaFree(va);
}
![Page 17: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/17.jpg)
Host/device synchronization!• Kernel calls are non-blocking. This means that the host
program continues immediately after it calls the kernel !– Allows overlap of computation on CPU and GPU!
• Use cudaThreadSynchronize() to wait for kernel to finish!
17
• Standard cudaMemcpy calls are blocking!– Non-blocking variants exist!
• It is also possible to synchronize threads within the same block with the __syncthreads()call !
vsumKernel<<<blocksPerGrid, threadsPerBlock>>>(a, b, c); //do work on host (that doesn’t depend on c) cudaThreadSynchronise(); //wait for kernel to finish
![Page 18: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/18.jpg)
Another example of CUDA kernel!• Sum of matrices: 2D decomposition of the problem!
18
__global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N]) {
int j = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y * blockDim.y + threadIdx.y;
c[i][j] = a[i][j] + b[i][j];
}
int main()
{
dim3 blocksPerGrid(N/16,N/16,1); // (N/16)x(N/16) blocks/grid (2D)
dim3 threadsPerBlock(16,16,1); // 16x16=256 threads/block (2D)
matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);
}
![Page 19: CUDAhome.deib.polimi.it › ... › 2017 › doc › slides › 2.HSA_CUDA_v1.pdf · 2017-12-12 · Introduction! • CUDA (Compute Unified Device Architecture) has been introduced](https://reader034.vdocuments.us/reader034/viewer/2022042322/5f0c56797e708231d434e745/html5/thumbnails/19.jpg)
References!• NVIDIA website:!– http://www.nvidia.com/object/what-is-gpu-
computing.html !
19