realtime computer graphics on gpus
TRANSCRIPT
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Realtime Computer Graphics on GPUsGPGPU II
Jan Kolomaznı́k
Department of Software and Computer Science EducationFaculty of Mathematics and Physics
Charles University in Prague
1 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Parallel Algorithms
2 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
INTRODUCTION
I Algorithms for massively parallel architectures:I Often bottom up designI Shallow datastructuresI Memory access patterns considered first
I Try to make all operations local onlyI Problem reformulation:I Search for possible constrainsI Solve dual problemI Cellular automataI . . .
3 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
BASIC META-ALGORITHMS
I Map :I ForEach (inplace?)I Transform
I Spatial filters with limited support:I ConvolutionI Morphological operations
4 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
REDUCE (FOLD)I Associative binary operation to combine input elements into
single value:I SumI MultiplicationI Min/Max
I On GPU:1. Tree-based approach used within each thread block2. Global sync (multiple kernels)3. Reduce block results
I Optimizations:I Prevent thread idlingI Shared memory access patternsI Ref: Mark Harris – Optimizing Parallel Reduction in CUDA (30x
speedup) 5 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SCAN (PREFIX-SUM)
I Associative binary operation to combine input elements infront of each of the processed elements
I Inclusive vs. exclusiveI Similar implementation and optimizations as reduction
6 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE: SELECT ELEMENTS BY INDICATOR
I Task:I Output elements selected by predicate
I Naive approach:I Adding Elements to output array directlyI Bottleneck – size update
I Better solution:I Two level solutionI Store local selection into shared memoryI Update global output size once per block
I Advanced solution:I Run parallel prefix-sum on indicator set (0/1 for each element)I Computes indices for all selected elements and final sizeI Final step – store all selected elements on precomputed
positions
7 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE: INTEGRAL IMAGES
I Apply prefix-sum to rows and columnsI Sum/average queries for rectangular regions with constant
complexityI Used in feature detectors:
I Haar featuresI Blur filter approximation
8 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DATASTRUCTURES
I Basic categories:I Dense arraysII Sparse structures
I MatricesI Graphs
I Hash tablesI Different criteria:
I Read/writeI Space wasteI How many threads are writing
I Two level design:I Local datastructure living in shared memoryI Main datastructure living in global memoryI Write local instances at the end of computation
9 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Deep Neural Networks
10 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DEEP NEURAL NETWORKS
I Another neural networks renaissanceI Large neural networks with lots of layers
I Convolutional networksI Large numbers of identical neurons – highly parallel by natureI Backpropagation
I Millions of parametersI Large training set
I Training vs. inference
11 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
EXAMPLE
12 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
CUDNN
I GPU-accelerated library of primitives for deep neural networksI Used for speedup of DNN frameworks like:
I TensorFlowI CaffeI PytorchI KerasI MatlabI . . .
I Special implementations for selected common cases – highlyoptimized
13 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
TENSORRT
I Also by NvidiaI Inference optimization
I Significantly less computationally demanding than trainingI Deployed on embedded systems – memory constrains
I Change network topology without sacrificing inferenceprecision
I Use lower numerical precisionI Less space occupied by weightsI Usage of tensor cores
14 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
TENSOR CORES
I New in Volta architectureI Accelerate matrix problems of the form D = A ∗ B + C
I Single core works on 4x4 matricesI fp16 precision for multiplicationI fp16 or fp32 for accumulation
I Warp matrix functionsI Special calls for specific matrix sizes
I Significant speedup of NN training and inferenceI Denoising in raytracing APIs
15 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OpenCL
16 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL
I Alternative to C for CUDAI Basic idea from C for CUDA, 1:1 equivalence in some partsI Programming model for execution of massivily parallel tasks
on CPU, GPU, Cell, . . .I Language:
I Originally subset of C99I Subset of C++14 in OpenCL 2.1I Just-in-time compilation
17 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
HOST CODE
I Code structure similar to shader programmingI API:
I clBuildProgram()I clCreateCommandQueue()I clCreateBuffer()I clEnqueueWriteBuffer()I . . .
18 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL CONCEPTS
OpenCL CUDA equivalentkernel kernelhost program host programNDRange (index space) gridwork item threadwork group block
19 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL THREADS
OpenCL CUDA equivalentget_global_id(0) blockIdx.x · blockDim.x + threadIdx.xget_local_id(0) threadIdx.xget_global_size(0) gridDim.x · blockDim.xget_local_size(0) blockDim.x
20 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
OPENCL MEMORY
OpenCL CUDA equivalentglobal memory global memoryconstant memory constant memorylocal memory shared memoryprivate memory local memory
21 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SAMPLE OPENCL KERNEL
// Kernel definition__kernel void VecAdd(
__global const float *A,__global const float *B,__global float *C)
{int id = get_global_id(0);C[id] = A[id] + B[id];
}
22 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SPIR
I Standard Portable Intermediate RepresentationI Distribute device-specific pre-compiled binariesI SPIR-V incorporated in the core specification of:
I OpenCL 2.1I Vulkan APII OpenGL 4.6
23 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Compute Shaders
24 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
MOTIVATION
I Why not OpenCL or CUDA?I One API for graphics and GP processingI Avoid interopI Avoid context switchesI You already know GLSL
I APIs:I Core since OpenGL 4.3I Part of OpenGL ES 3.1I Vulcan
25 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
WHERE IT BELONGS?
26 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
USAGE
I Write compute shader in GLSLI Define memory resourcesI Write main()function
I InitializationI Allocate GPU memory (buffers, textures)I Compile shader, link program
I Run itI Bind buffers, textures, images, uniformsI Call glDispatchCompute(...)
27 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Other APIs
28 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
DIRECTCOMPUTE
I Part of Microsoft DirectX collection of APIsI Needs GPU support for DX10 or DX11
29 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
C++ AMP, OPENACC
I C++ Accelerated Massive Parallelism:I Open specification from MicrosoftI Builds on DX11I Language extensions, runtime libraryI Heterogenous computation
I OpenACC:I Similar to OpenMPI Compiler directives (pragmas) + runtime library
30 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SYCL
I Standard by KhronosI Cross-platform abstraction layerI Builds on the underlying concepts, portability and efficiency of
OpenCLI Single-source style using completely standard C++
31 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
SAMPLE SYCL CODE
# inc lude<CL/ syc l . hpp># inc lude<iostream>
i n tmain ( ) {
using namespace cl : : sycl ;i n t data [ 1 0 2 4 ] ;/ / i n i t i a l i z e data to be worked on/ / By i n c l u d i n g a l l the SYCL work i n a {} block , we ensure/ / a l l SYCL tasks must complete before e x i t i n g the block{
queue myQueue ;bufferbuffer<i n t , 1> resultBuf (data , range<1>(1024) ) ;/ / c reate a command group to issue commands to the queuemyQueue .submit ( [ & ] ( handler& cgh ) {
/ / request access to the b u f f e rauto writeResult = resultBuf .get_access<access : : write>(cgh ) ;/ / enqueuea p a r a l l e l f o r taskcgh .parallel_for<c lass simple_test>(range<1>(1024) , [ = ] ( id<1> idx ) {
writeResult [idx ] = idx [ 0 ] ;} ) ;}) ;
} / / end of scope , so we wa i t f o r the queued work to completef o r ( i n t i = 0; i < 1024; i++)
std : : cout<< ” data [ ” << i << ” ] = ” << data [i ] << std : : endl ;r e t u r n 0 ;
}
32 / 33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs
Thank you for your attention!
33 / 33