realtime computer graphics on gpus

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Realtime Computer Graphics on GPUsGPGPU II

Jan Kolomaznı́k

Department of Software and Computer Science EducationFaculty of Mathematics and Physics

Charles University in Prague

1 / 33


Parallel Algorithms

2 / 33


INTRODUCTION

I Algorithms for massively parallel architectures:I Often bottom up designI Shallow datastructuresI Memory access patterns considered first

I Try to make all operations local onlyI Problem reformulation:I Search for possible constrainsI Solve dual problemI Cellular automataI . . .

3 / 33


BASIC META-ALGORITHMS

I Map :I ForEach (inplace?)I Transform

I Spatial filters with limited support:I ConvolutionI Morphological operations

4 / 33


REDUCE (FOLD)I Associative binary operation to combine input elements into

single value:I SumI MultiplicationI Min/Max

I On GPU:1. Tree-based approach used within each thread block2. Global sync (multiple kernels)3. Reduce block results

I Optimizations:I Prevent thread idlingI Shared memory access patternsI Ref: Mark Harris – Optimizing Parallel Reduction in CUDA (30x

speedup) 5 / 33


SCAN (PREFIX-SUM)

I Associative binary operation to combine input elements infront of each of the processed elements

I Inclusive vs. exclusiveI Similar implementation and optimizations as reduction

6 / 33


EXAMPLE: SELECT ELEMENTS BY INDICATOR

I Task:I Output elements selected by predicate

I Naive approach:I Adding Elements to output array directlyI Bottleneck – size update

I Better solution:I Two level solutionI Store local selection into shared memoryI Update global output size once per block

I Advanced solution:I Run parallel prefix-sum on indicator set (0/1 for each element)I Computes indices for all selected elements and final sizeI Final step – store all selected elements on precomputed

positions

7 / 33


EXAMPLE: INTEGRAL IMAGES

I Apply prefix-sum to rows and columnsI Sum/average queries for rectangular regions with constant

complexityI Used in feature detectors:

I Haar featuresI Blur filter approximation

8 / 33


DATASTRUCTURES

I Basic categories:I Dense arraysII Sparse structures

I MatricesI Graphs

I Hash tablesI Different criteria:

I Read/writeI Space wasteI How many threads are writing

I Two level design:I Local datastructure living in shared memoryI Main datastructure living in global memoryI Write local instances at the end of computation

9 / 33


Deep Neural Networks

10 / 33


DEEP NEURAL NETWORKS

I Another neural networks renaissanceI Large neural networks with lots of layers

I Convolutional networksI Large numbers of identical neurons – highly parallel by natureI Backpropagation

I Millions of parametersI Large training set

I Training vs. inference

11 / 33


EXAMPLE

12 / 33


CUDNN

I GPU-accelerated library of primitives for deep neural networksI Used for speedup of DNN frameworks like:

I TensorFlowI CaffeI PytorchI KerasI MatlabI . . .

I Special implementations for selected common cases – highlyoptimized

13 / 33


TENSORRT

I Also by NvidiaI Inference optimization

I Significantly less computationally demanding than trainingI Deployed on embedded systems – memory constrains

I Change network topology without sacrificing inferenceprecision

I Use lower numerical precisionI Less space occupied by weightsI Usage of tensor cores

14 / 33


TENSOR CORES

I New in Volta architectureI Accelerate matrix problems of the form D = A ∗ B + C

I Single core works on 4x4 matricesI fp16 precision for multiplicationI fp16 or fp32 for accumulation

I Warp matrix functionsI Special calls for specific matrix sizes

I Significant speedup of NN training and inferenceI Denoising in raytracing APIs

15 / 33


OpenCL

16 / 33


OPENCL

I Alternative to C for CUDAI Basic idea from C for CUDA, 1:1 equivalence in some partsI Programming model for execution of massivily parallel tasks

on CPU, GPU, Cell, . . .I Language:

I Originally subset of C99I Subset of C++14 in OpenCL 2.1I Just-in-time compilation

17 / 33


HOST CODE

I Code structure similar to shader programmingI API:

I clBuildProgram()I clCreateCommandQueue()I clCreateBuffer()I clEnqueueWriteBuffer()I . . .

18 / 33


OPENCL CONCEPTS

OpenCL CUDA equivalentkernel kernelhost program host programNDRange (index space) gridwork item threadwork group block

19 / 33


OPENCL THREADS

OpenCL CUDA equivalentget_global_id(0) blockIdx.x · blockDim.x + threadIdx.xget_local_id(0) threadIdx.xget_global_size(0) gridDim.x · blockDim.xget_local_size(0) blockDim.x

20 / 33


OPENCL MEMORY

OpenCL CUDA equivalentglobal memory global memoryconstant memory constant memorylocal memory shared memoryprivate memory local memory

21 / 33


SAMPLE OPENCL KERNEL

// Kernel definition__kernel void VecAdd(

__global const float *A,__global const float *B,__global float *C)

{int id = get_global_id(0);C[id] = A[id] + B[id];

}

22 / 33


SPIR

I Standard Portable Intermediate RepresentationI Distribute device-specific pre-compiled binariesI SPIR-V incorporated in the core specification of:

I OpenCL 2.1I Vulkan APII OpenGL 4.6

23 / 33


Compute Shaders

24 / 33


MOTIVATION

I Why not OpenCL or CUDA?I One API for graphics and GP processingI Avoid interopI Avoid context switchesI You already know GLSL

I APIs:I Core since OpenGL 4.3I Part of OpenGL ES 3.1I Vulcan

25 / 33


WHERE IT BELONGS?

26 / 33


USAGE

I Write compute shader in GLSLI Define memory resourcesI Write main()function

I InitializationI Allocate GPU memory (buffers, textures)I Compile shader, link program

I Run itI Bind buffers, textures, images, uniformsI Call glDispatchCompute(...)

27 / 33


Other APIs

28 / 33


DIRECTCOMPUTE

I Part of Microsoft DirectX collection of APIsI Needs GPU support for DX10 or DX11

29 / 33


C++ AMP, OPENACC

I C++ Accelerated Massive Parallelism:I Open specification from MicrosoftI Builds on DX11I Language extensions, runtime libraryI Heterogenous computation

I OpenACC:I Similar to OpenMPI Compiler directives (pragmas) + runtime library

30 / 33


SYCL

I Standard by KhronosI Cross-platform abstraction layerI Builds on the underlying concepts, portability and efficiency of

OpenCLI Single-source style using completely standard C++

31 / 33


SAMPLE SYCL CODE

# inc lude<CL/ syc l . hpp># inc lude<iostream>

i n tmain ( ) {

using namespace cl : : sycl ;i n t data [ 1 0 2 4 ] ;/ / i n i t i a l i z e data to be worked on/ / By i n c l u d i n g a l l the SYCL work i n a {} block , we ensure/ / a l l SYCL tasks must complete before e x i t i n g the block{

queue myQueue ;bufferbuffer<i n t , 1> resultBuf (data , range<1>(1024) ) ;/ / c reate a command group to issue commands to the queuemyQueue .submit ( [ & ] ( handler& cgh ) {

/ / request access to the b u f f e rauto writeResult = resultBuf .get_access<access : : write>(cgh ) ;/ / enqueuea p a r a l l e l f o r taskcgh .parallel_for<c lass simple_test>(range<1>(1024) , [ = ] ( id<1> idx ) {

writeResult [idx ] = idx [ 0 ] ;} ) ;}) ;

} / / end of scope , so we wa i t f o r the queued work to completef o r ( i n t i = 0; i < 1024; i++)

std : : cout<< ” data [ ” << i << ” ] = ” << data [i ] << std : : endl ;r e t u r n 0 ;

}

32 / 33


Thank you for your attention!

33 / 33

realtime computer graphics on gpus

Documents