realtime computer graphics on gpus

33
Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs Realtime Computer Graphics on GPUs GPGPU II Jan Kolomazn´ ık Department of Software and Computer Science Education Faculty of Mathematics and Physics Charles University in Prague 1 / 33

Upload: others

Post on 25-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Realtime Computer Graphics on GPUsGPGPU II

Jan Kolomaznı́k

Department of Software and Computer Science EducationFaculty of Mathematics and Physics

Charles University in Prague

1 / 33

Page 2: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Parallel Algorithms

2 / 33

Page 3: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

INTRODUCTION

I Algorithms for massively parallel architectures:I Often bottom up designI Shallow datastructuresI Memory access patterns considered first

I Try to make all operations local onlyI Problem reformulation:I Search for possible constrainsI Solve dual problemI Cellular automataI . . .

3 / 33

Page 4: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

BASIC META-ALGORITHMS

I Map :I ForEach (inplace?)I Transform

I Spatial filters with limited support:I ConvolutionI Morphological operations

4 / 33

Page 5: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

REDUCE (FOLD)I Associative binary operation to combine input elements into

single value:I SumI MultiplicationI Min/Max

I On GPU:1. Tree-based approach used within each thread block2. Global sync (multiple kernels)3. Reduce block results

I Optimizations:I Prevent thread idlingI Shared memory access patternsI Ref: Mark Harris – Optimizing Parallel Reduction in CUDA (30x

speedup) 5 / 33

Page 6: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SCAN (PREFIX-SUM)

I Associative binary operation to combine input elements infront of each of the processed elements

I Inclusive vs. exclusiveI Similar implementation and optimizations as reduction

6 / 33

Page 7: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE: SELECT ELEMENTS BY INDICATOR

I Task:I Output elements selected by predicate

I Naive approach:I Adding Elements to output array directlyI Bottleneck – size update

I Better solution:I Two level solutionI Store local selection into shared memoryI Update global output size once per block

I Advanced solution:I Run parallel prefix-sum on indicator set (0/1 for each element)I Computes indices for all selected elements and final sizeI Final step – store all selected elements on precomputed

positions

7 / 33

Page 8: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE: INTEGRAL IMAGES

I Apply prefix-sum to rows and columnsI Sum/average queries for rectangular regions with constant

complexityI Used in feature detectors:

I Haar featuresI Blur filter approximation

8 / 33

Page 9: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DATASTRUCTURES

I Basic categories:I Dense arraysII Sparse structures

I MatricesI Graphs

I Hash tablesI Different criteria:

I Read/writeI Space wasteI How many threads are writing

I Two level design:I Local datastructure living in shared memoryI Main datastructure living in global memoryI Write local instances at the end of computation

9 / 33

Page 10: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Deep Neural Networks

10 / 33

Page 11: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DEEP NEURAL NETWORKS

I Another neural networks renaissanceI Large neural networks with lots of layers

I Convolutional networksI Large numbers of identical neurons – highly parallel by natureI Backpropagation

I Millions of parametersI Large training set

I Training vs. inference

11 / 33

Page 12: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

EXAMPLE

12 / 33

Page 13: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

CUDNN

I GPU-accelerated library of primitives for deep neural networksI Used for speedup of DNN frameworks like:

I TensorFlowI CaffeI PytorchI KerasI MatlabI . . .

I Special implementations for selected common cases – highlyoptimized

13 / 33

Page 14: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

TENSORRT

I Also by NvidiaI Inference optimization

I Significantly less computationally demanding than trainingI Deployed on embedded systems – memory constrains

I Change network topology without sacrificing inferenceprecision

I Use lower numerical precisionI Less space occupied by weightsI Usage of tensor cores

14 / 33

Page 15: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

TENSOR CORES

I New in Volta architectureI Accelerate matrix problems of the form D = A ∗ B + C

I Single core works on 4x4 matricesI fp16 precision for multiplicationI fp16 or fp32 for accumulation

I Warp matrix functionsI Special calls for specific matrix sizes

I Significant speedup of NN training and inferenceI Denoising in raytracing APIs

15 / 33

Page 16: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OpenCL

16 / 33

Page 17: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL

I Alternative to C for CUDAI Basic idea from C for CUDA, 1:1 equivalence in some partsI Programming model for execution of massivily parallel tasks

on CPU, GPU, Cell, . . .I Language:

I Originally subset of C99I Subset of C++14 in OpenCL 2.1I Just-in-time compilation

17 / 33

Page 18: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

HOST CODE

I Code structure similar to shader programmingI API:

I clBuildProgram()I clCreateCommandQueue()I clCreateBuffer()I clEnqueueWriteBuffer()I . . .

18 / 33

Page 19: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL CONCEPTS

OpenCL CUDA equivalentkernel kernelhost program host programNDRange (index space) gridwork item threadwork group block

19 / 33

Page 20: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL THREADS

OpenCL CUDA equivalentget_global_id(0) blockIdx.x · blockDim.x + threadIdx.xget_local_id(0) threadIdx.xget_global_size(0) gridDim.x · blockDim.xget_local_size(0) blockDim.x

20 / 33

Page 21: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

OPENCL MEMORY

OpenCL CUDA equivalentglobal memory global memoryconstant memory constant memorylocal memory shared memoryprivate memory local memory

21 / 33

Page 22: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SAMPLE OPENCL KERNEL

// Kernel definition__kernel void VecAdd(

__global const float *A,__global const float *B,__global float *C)

{int id = get_global_id(0);C[id] = A[id] + B[id];

}

22 / 33

Page 23: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SPIR

I Standard Portable Intermediate RepresentationI Distribute device-specific pre-compiled binariesI SPIR-V incorporated in the core specification of:

I OpenCL 2.1I Vulkan APII OpenGL 4.6

23 / 33

Page 24: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Compute Shaders

24 / 33

Page 25: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

MOTIVATION

I Why not OpenCL or CUDA?I One API for graphics and GP processingI Avoid interopI Avoid context switchesI You already know GLSL

I APIs:I Core since OpenGL 4.3I Part of OpenGL ES 3.1I Vulcan

25 / 33

Page 26: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

WHERE IT BELONGS?

26 / 33

Page 27: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

USAGE

I Write compute shader in GLSLI Define memory resourcesI Write main()function

I InitializationI Allocate GPU memory (buffers, textures)I Compile shader, link program

I Run itI Bind buffers, textures, images, uniformsI Call glDispatchCompute(...)

27 / 33

Page 28: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Other APIs

28 / 33

Page 29: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

DIRECTCOMPUTE

I Part of Microsoft DirectX collection of APIsI Needs GPU support for DX10 or DX11

29 / 33

Page 30: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

C++ AMP, OPENACC

I C++ Accelerated Massive Parallelism:I Open specification from MicrosoftI Builds on DX11I Language extensions, runtime libraryI Heterogenous computation

I OpenACC:I Similar to OpenMPI Compiler directives (pragmas) + runtime library

30 / 33

Page 31: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SYCL

I Standard by KhronosI Cross-platform abstraction layerI Builds on the underlying concepts, portability and efficiency of

OpenCLI Single-source style using completely standard C++

31 / 33

Page 32: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

SAMPLE SYCL CODE

# inc lude<CL/ syc l . hpp># inc lude<iostream>

i n tmain ( ) {

using namespace cl : : sycl ;i n t data [ 1 0 2 4 ] ;/ / i n i t i a l i z e data to be worked on/ / By i n c l u d i n g a l l the SYCL work i n a {} block , we ensure/ / a l l SYCL tasks must complete before e x i t i n g the block{

queue myQueue ;bufferbuffer<i n t , 1> resultBuf (data , range<1>(1024) ) ;/ / c reate a command group to issue commands to the queuemyQueue .submit ( [ & ] ( handler& cgh ) {

/ / request access to the b u f f e rauto writeResult = resultBuf .get_access<access : : write>(cgh ) ;/ / enqueuea p a r a l l e l f o r taskcgh .parallel_for<c lass simple_test>(range<1>(1024) , [ = ] ( id<1> idx ) {

writeResult [idx ] = idx [ 0 ] ;} ) ;}) ;

} / / end of scope , so we wa i t f o r the queued work to completef o r ( i n t i = 0; i < 1024; i++)

std : : cout<< ” data [ ” << i << ” ] = ” << data [i ] << std : : endl ;r e t u r n 0 ;

}

32 / 33

Page 33: Realtime Computer Graphics on GPUs

Parallel Algorithms Deep Neural Networks OpenCL Compute Shaders Other APIs

Thank you for your attention!

33 / 33