gpu programming workshop (nus extension)tants/cuda-course-nov-2010/...6 3. differences in...

1

GPU Programming WorkshopGPU Programming Workshop(NUS Extension)(NUS Extension)

18-19 November 2010

Solving problems with thousands of “CPU”s

18 19 November 2010

Instructor: Dr Tan Tiow Seng ([email protected])http://www.comp.nus.edu.sg/~tants

Assistants: Cao Thanh Tung, Tang Ke

Session (Session (1 1 of of 88))Day 1, Morning, 9:00am – 10:30am - Introduction

1. Why we need GPU (thousands of “CPUs”) ?

2. Demo of GPU Examples

3. From Sequential to Parallel Computational Thinking

4. How to work with GPU? - CPU “talks” to GPUs through global memory- GPUs “perform” the work with kernel (CUDA - an

2

GPUs perform the work with kernel (CUDA an extension of C/C++)

- CUDA: 1st PP – Vector Addition5. CUDA “Concepts” of threads, blocks and grid

Break: 10:30am – 10:50am

2

1. Why do we need GPU?1. Why do we need GPU?

3

CPU vs GPU Performance Animation from 2000 till 2009 http://www.youtube.com/watch?v=eH96JE-CnHw

Fermi architecture discussion (video) http://www.nvidia.com/object/fermi_architecture.html


4

3


5

1. Why do we need GPU?1. Why do we need GPU?CPU: Moore’s Law

GPU: Moore’s Law Cubed (?)( )

Schematic different between CPU and GPUInterview with David Kirk: http://www.gamespot.com/news/2836326.html

6

Analogy:Developed Country Style Developing Country Style

Who is winning (country) currently? http://gpgpu.org/2010/10/28/nvidia-tesla-gpus-power-worlds-fastest-supercomputer#more-2931 : Tianhe-1A, 2.507petaflops

4

It is all about Performance & ScalabilityScalability is also key to power efficiency


“Development” (Programming) is to be done differently

Key Techniques:To regularize work & data for massively parallel executionTo localize data for conserving memory bandwidth

Key HurdlesKey HurdlesSerialization due to conflicting updateOver subscription of resourcesLoad imbalance among threads

http://www.gpucomputing.net/?q=node/2376 7

2. Demo of GPU Examples2. Demo of GPU ExamplesGraphics with GPU

Shadow computationR l ti d iReal-time renderingMaya©, Photoshop ©, etc.

General Purpose GPU www.gpgpu.orgMatlab©

8

Matlab© Geometric ComputationWen-mei Hwu’s talk….

http://www.gpucomputing.net/?q=node/2376

David Kirk & Wen-mei Hwu: There are many more parallel computers in the recent year than that was sold in the past 30 years.

5

2. Demo of GPU Examples2. Demo of GPU ExamplesExamples in CUDA SDK

Particle systemImage denoisingImage denoisingSmoke particle

G3 Research LabShadow computation videoJump Flooding Algorithm (JFA) : Voronoi Diagram videoParallel Banding Algorithm (PBA) video2D Delaunay Triangulation (gpuDT) video3

9

3D gHull video3D gpu3DT

http://www.comp.nus.edu.sg/~tants

3. From Sequential to Parallel 3. From Sequential to Parallel Computational Thinking Computational Thinking

Direct translation from sequential to parallel threadsExample problem: known input to known output (location) …need gather rather scatter (Vector addition)…need gather rather scatter (Vector addition)Example problem: known input to unknown output (location) …need privatization (Histogram)

More than the above…one possible approach(1) Sequential solution (2) Sequential solution with the right data structure (3) parallel solution (4) parallel solution with work (algorithmic) optimization (5) parallel

10

solution with work (algorithmic) optimization (5) parallel solution with hardware considerationPossibility of a different sequential thinking

Any algorithmic paradigms Will revisit this point at the end http://www.gpucomputing.net/?q=node/2376

6

3. Differences in Computational Thinking 3. Differences in Computational Thinking Sequential Computation Thinking

Do not care much about the “machine” (in general)Avoid any wastage of computing at all time Just do it; no one to cooperate – work to be done, has to be done still. Nothing to ; p , gcomplain – just one person on the projectOptimization should be invariant with time….

Multi-core Parallel Computational ThinkingStill ok to ignore the underlying architectureRedundant work still undesirableCooperation between “workers” becomes a problem, who does what, avoids overlapping/conflicting of interest – work balancing

Parallel Computation Thinking

11

Understand every bit of the “machine” is important. Resources shared by many, and need to know the underlying “rules” for best utilization. Redundant computations are acceptable for overall performance of the “team”How to divide the work becomes a significant problem (work balancing), cooperation in a team with thousands of people is not easy. Communication adds another layer of complexity. Not that more people means fastest.Optimization may not be true “tomorrow”

33. Differences in Computational Thinking . Differences in Computational Thinking Difficulties in Massively Multi-threading Programming1. Indexing mechanism – mapping of data to “computer”

(learn throughout the course)(learn throughout the course)Total “global store”“global store” in “tile form”# of threads in a block to handle “global store”# of threads in a block formed a tile of a different tile size as the global store

12

2. Memory (hardware) consideration (Session #5)

Shared memory with no bank conflictGlobal memory access in a coalesced manner

3. Program correctness without race condition (Session #6)

7

4a. CPU 4a. CPU (HOST) (HOST) Talking to GPU Talking to GPU (DEVICE)(DEVICE)Pass Input

Write to Global MemoryAsk GPU where to put the dataPut the data there

(more to come)…

Get OutputCopy from Global Memory

13

Know where to pick up the data

Leave it within device for subsequent computation

Ask the work (outsourcing) to be doneLaunching Kernel

4a. What is CUDA4a. What is CUDA“Compute Unified Device Architecture”NVIDIA’s solution to parallel computation (for G80 series onward)Parallel programming modelMassively multithreaded hardware

GPU = dedicated many-core co-processorGeneral purpose programming model

User launches batches of threads on the GPU Fully general load/store memory model (concurrent read concurrent write)Simple extension to standard C/C++

14

Mature(d) software stack (high-level and low-level access)

8

4a. What is CUDA4a. What is CUDAIts core are three key abstractions exposed to programmers:

A hierarchy of thread groupsShared memories, andBarrier synchronization

The abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition:

the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and

15

each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability.Difficulty to us: NOT ALL BLOCKS ARE IN THE COMPUTATION AT THE SAME TIME.

4a. Talking through Global Memory4a. Talking through Global MemoryGPU memory allocation / release

cudaMalloc(void **pointer, size_t nbytes);cudaMemset(void *pointer int value size t count);cudaMemset(void pointer, int value, size_t count);cudaFree(void* pointer);

Data copycudaMemcpy(void *dst, void *src, size_t nbytes,

enum cudaMemcpyKind direction);direction specifies locations (host or device) of src and dstBlocking (there is a corresponding non-blocking version)

16

Blocking (there is a corresponding non-blocking version)Doesn’t start copying until previous CUDA calls completeenum cudaMemcpyKind

cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice

9

4b. Launching Kernels4b. Launching KernelsModified C/C++ function call syntax

kernel<<<dim3 grid, dim3 block, int smem, int stream>>>(…)

Execution Configuration (“<<< >>>”)grid: grid dimensions x and yblock: thread-block dimensions x, y, and zsmem: dynamic shared memory size, measured by number of bytes per block

17

optional, 0 by default

stream: stream IDoptional, 0 by default

CUDA C Program – main partCUDA kernel

4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (1 of 3)Vector Addition (1 of 3)

__global__ void VecAdd( float *A, float *B, float *C ){

int i = blockIdx.x * blockDim.x + threadIdx.x;if ( i < N ) C[i] = A[i] + B[i];

}

id i ()

Device code

Host code

18

void main(){

......int threadsPerBlock = 256;int blocksPerGrid =(N + threadsPerBlock – 1)/threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C );

}

Host code

10

4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (2 of 3)Vector Addition (2 of 3)#include <cutil_inline.h>#define N 204800

// Device code//__global__ void VecAdd( float *A, float *B, float *C ){

int i = blockIdx.x * blockDim.x + threadIdx.x;if ( i < N ) C[i] = A[i] + B[i];

}

// Host codeint main(){

19

{// Allocate vectors in device memorysize_t size = N * sizeof(float);float *d_A, *d_B, *d_C;cudaMalloc( (void**)&d_A, size );cudaMalloc( (void**)&d_B, size );cudaMalloc( (void**)&d_C, size );

4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (3 of 3)Vector Addition (3 of 3)float* h_A = (float *) malloc ( size );float* h_B = (float *) malloc ( size );float* h_C = (float *) malloc ( size );...add code to initialize h_A and h_B// C t f h t t d i// Copy vectors from host memory to device memorycudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice );

// Invoke kernelint threadsPerBlock = 256;int blocksPerGrid =(N + threadsPerBlock – 1)/threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C );

// Copy result from device memory to host memory

20

// Copy result from device memory to host memory// h_C contains the result in host memorycudaMemcpy( h_C, d_C, size, cudaMemcpyDeviceToHost );

// Free device memorycudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

}

Demo: Study MS Visual Studio Project on Vector Addition, different block size, timing with/out copying etc.

11

CUDA C Program – main part (matrix N x N)

__global__void addMatrixG( float *a float *b float *c int N )

CUDA kernel

4c. CUDA : Matrix Addition (2D case) 4c. CUDA : Matrix Addition (2D case)

void addMatrixG( float *a, float *b, float *c, int N ){

int i = blockIdx.x * blockDim.x + threadIdx.x;int j = blockIdx.y * blockDim.y + threadIdx.y;int index = i + j * N;if ( i < N && j < N )

c[index] = a[index] + b[index];}

void main()

Device code

Host code

21

(){

......dim3 dimBlk( 16, 16 );dim3 dimGrd( (N-1)/dimBlk.x + 1, (N-1)/dimBlk.y + 1 );addMatrixG<<<dimGrd, dimBlk>>>( a, b, c, N );

}

Host code

5. CUDA : 5. CUDA : Thread HierarchyThread HierarchyA kernel is executed by a gridof thread blocks

A th d bl k i b t h fA thread block is a batch of threads that can cooperate with each other by

Sharing data through shared memorySynchronizing their execution

22

Threads from different blocks cannot synchronize (within a kernel)

Some blocks may not be created when others have finished their computation

12

55. CUDA : Block . CUDA : Block IDs and Thread IDsIDs and Thread IDsEach thread uses IDs to decide what data to work on

Block ID: 1D or 2DBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

23

multidimensional dataImage processingSolving PDEs on volumes…

Session (2 of 8)Session (2 of 8)Day 1, Morning, 10:50am – 12:30noon – Get it going

5. CUDA basic concepts (organizing the resources)- Threads, blocks and grid

6. Hands-on #1: 2nd PP – image processing- Learn to deal with texture/display

7. Working with Microsoft Visual Studio- Learn to create project to work from any computer

8 H d #2 3rd PP V i Di ( t 1)

24

8. Hands-on #2: 3rd PP – Voronoi Diagram (part 1) - Learn to create a (quick) solution from scratch

Lunch break: 12:30noon – 2:00pm

13

5. CUDA at work5. CUDA at work

Device

Global Memory

HostShared memory

25Texture

working space: register

5. CUDA at work5. CUDA at workFor the host:

How “best” to pass the data to the device Options of global memory, constant memory, function parameters (and cuda array, texture)y, )

How “best” to get the work organized (in passes) to the deviceOptions to do filtering before actual workOptions to cut work into pieces to be doneOption to do the work yourself instead of outsourcing

For the device: How “best” to deploy the resources to get the work done

Options to organize into up to 2D blocks of up to 3D threads

26

Options to cooperate among threads through shared memoryOptions to cooperate across blocks through global memoryOptions on when to move data around for computation

14

5. CUDA5. CUDAComes with

Hardware architectureProgramming modelProgramming interface

C/C++ for CUDA (extension to the C/C++ language)CUDA Driver API

Software development environmentTools (compiler, debugger, profiler), C/C++ runtime, libraries

Only works on NVIDIA G80 (Geforce 8000 series) and newer cardsDesigned to scale well over time

27

Designed to scale well over timeMeant to be more general than just graphics cards

55. CUDA . CUDA Programming ModelProgramming ModelSome design goals

Transparently scales to hundreds of cores and thousands of parallel threadsof parallel threads

Allows software to run efficiently across varying hardware

Let programmers focus on parallel algorithmsNot mechanics of a parallel programming languageC/C++ for CUDA plus runtime API

Enable heterogeneous systems (i.e. CPU + GPU)

28

Enable heterogeneous systems (i.e. CPU GPU)CPU and GPU are separate devices with separate DRAMs

15

5. CUDA : 5. CUDA : Kernels and ThreadsKernels and ThreadsDefinitions

Device = GPUHost = CPUHost = CPUKernel = function that runs on the device

Parallel portions of an application are executed on the device as kernels

One kernel is executed at a time (Fermi can run multiple kernels, using different stream numbers other than 0)Many threads execute the same kernel (on different data)

29

Many threads execute the same kernel (on different data)CUDA threads are extremely lightweight

Very little creation overheadInstant switching

CUDA uses thousands of threads to achieve efficiency

5. CUDA : Arrays 5. CUDA : Arrays of Parallel Threadsof Parallel ThreadsA CUDA kernel is executed by an array of threads

All threads run the same codeSPMD model –– Single Program, Multiple Data

Each thread has an ID to identify itself so that it knows what computation or control to work on (e.g. address computation)

76543210threadID

30

…float x = input[threadID];float y = func(x);output[threadID] = y;…

16

55. CUDA : Transparent . CUDA : Transparent ScalabilityScalabilityHardware is free to schedule thread blocks to any processor

A kernel scales across any number of parallelA kernel scales across any number of parallel multiprocessors

Device

Block 0 Block 1

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3ti

31

Each block can execute in any order relative to other blocks.

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Block 4 Block 5 Block 6 Block 7time

5. CUDA : 5. CUDA : Thread Thread Hierarchy Hierarchy (duplicate)(duplicate)

A kernel is executed by a grid of thread blocks

A th d bl k i b t h fA thread block is a batch of threads that can cooperate with each other by

Sharing data through shared memorySynchronizing their

32

execution

Threads from different blocks “cannot” cooperate

17

5. CUDA : Block 5. CUDA : Block IDs and Thread IDs and Thread IDs IDs (duplicate)(duplicate)

Each thread uses IDs to decide what data to work on

Block ID: 1D or 2DBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data

33

multidimensional dataImage processingSolving PDEs on volumes…

5. CUDA : Execution 5. CUDA : Execution ModelModelKernels are launched in grids

One Kernel executes at a time (not true for Fermi)

A block executes on one multiprocessorA block executes on one multiprocessorDoes not migrate

Several blocks can reside concurrently on one multiprocessorControl limitations (of G8X/G9X GPUs) (128 Scalar Processors grouped into 16 Streaming Multiprocessors SMs)

At most 8 concurrent blocks per SMS

34

At most 768 concurrent threads per SM

Number is further limited by SM resourcesRegister file is partitioned among all resident threadsShared memory is partitioned among all resident thread blocks

18

5. CUDA : Device 5. CUDA : Device CodeCodeC/C++ functions with some restrictions

Can only access GPU memory (except for Zero Copy feature)No variable number of arguments (“varargs”)No variable number of arguments ( varargs )No static variable

Function qualifiers__global__

Function is a kernelMust have void return typeA call to function must specify execution configuration

35

__device__Only for the device to call

__host__ (default)Only for the host to call

__host__ and __device__ can be used together

5. CUDA : Built5. CUDA : Built--In In Device VariablesDevice VariablesAll __global__ and __device__functions have access to these automatically defined variables

dim3 gridDim;dim3 gridDim;Dimensions of the grid in blocks (gridDim.z must be 1)

dim3 blockDim;Dimensions of the block in threads

dim3 blockIdx;Block index within the grid

36

Block index within the grid

dim3 threadIdx;Thread index within the block

19

5. CUDA : Variable 5. CUDA : Variable Qualifiers (GPU Code)Qualifiers (GPU Code)__device__

Limited per thread (16KB – 32KB for old architecture, 512KB for Fermi)Stored in global memory (large, high latency, no cache)

__shared__Stored in on-chip shared memory (very low latency)Accessible concurrently by all threads in the same thread block

Unqualified variablesScalars and built-in vector types are stored in registersArrays of more than 4 elements or run-time indices stored in local memory

(not in kernel) __constant__ (limited in size, about 64KB, global

37

( ) __ __ ( , , gSame as __device__, but cached and read-only by GPUWritten by CPU via cudaMemcpyToSymbol()

66. Hands. Hands--on #on #11: : 22ndnd PP PP –– Process Image Process Image .cpp and .cu programs are given

Briefing on reading image (as array of imgWidth * imgHeight * 4 GLbyte)Briefing on OpenGL texture, texture swapping for displayBriefing on GLUT for keyboard input

Task 1: Get the application working (see also Section #7)Include the necessary CUDA build rules to get .cu program to be compiledInclude cudart.lib to perform linkingInclude correct path for CUDA stuff, e.g. $(CUDA_LIB_PATH) Execute and run the program

Task 2: Perform if you like other filtering of the image

38

Task 2: Perform, if you like, other filtering of the imageFor example, remove red, or average surrounding pixels, etc.

Option: 1 (original) Option: 2 (inverted)

20

6. Hands6. Hands--on #1: 2on #1: 2ndnd PP PP –– Process Image Process Image OpenGL / Glut Program

main

glut

glut_mouseMotionglut_mouse glut_keyboard

glutMainLoop

glutSwapBuffersinvertImage

glut

39

cudaInvertImage

KernelInvertImageGPU

g

CPUAllocate memory etc.to execute kernel

7. Working with Microsoft VS7. Working with Microsoft VSDownload & install CUDA Toolkit (for run time) + SDK (for development) from Developer Zone (once)

http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.htmlTo create a new Win32 Console Application with Visual Studio:To create a new Win32 Console Application with Visual Studio:

File New Project. Select Win32 Console ApplicationEnter MyApp as the name of the project. Under Application Settings, check Empty Project

To make sure nvcc compiles your .cu file (before CUDA ver 3.2)Assume Cuda.rules is somewhere in your system (C:\Program Files\Microsoft Visual Studio 9.0\VC\VCProjectDefaults). Right click on MyApp in the Solution Explorer, choose Custom Build

40

g yRules, find existing Cuda.rules files to be included, and then check the Cuda.rules file.

To exclude (sometimes) some .cu files from build:Right click on it in the Solution Explorer, and select PropertiesUnder General, set Excluded From Build to Yes

21

7. Working with Microsoft VS7. Working with Microsoft VSTo configure the project properties

Select Project MyApp Properties from the main menu

Select All Configuration for Configuration

Fill in Configuration Properties Linker General Additional Library Directories with $(CUDA_LIB_PATH) (and any other paths, such as “$(NVSDKCOMPUTE_ROOT)/C/common/lib ; ..\lib”)

Fill in Configuration Properties Linker Input Additional Dependencieswith cudart.lib (and other libraries if needed, for example, cutil32D.lib cutil32.lib glut32.lib)

Fill in Configuration Properties CUDA Build Rule vX.X.0 General with “sm_11” (or higher) instead of “sm_10” in GPU Architecture (in case you use atomic operation)

41

atomic operation)

7. CUDA Configuration7. CUDA ConfigurationYou ask and we try to answer

42

22

7. CUDA 7. CUDA Programming InterfacesProgramming InterfacesCUDA driver API

Low-level C/C++ API to load compiled kernels, inspect their parameters, and to launch themKernels are written in C/C++ and compiled to CUDA binary or assembly codeRequires more code, harder to program and debug

C/C++ for CUDAMinimal set of extensions to the C/C++ languageKernels defined as C/C++ functions embedded in application

43

Kernels defined as C/C++ functions embedded in application sourceRequires a runtime API (built on top of CUDA driver API)

77. CUDA . CUDA LayersLayers

cudart.lib

cudart.lib

44

23

7. Compiling 7. Compiling a CUDA Programa CUDA Program

45

7. Compilation7. CompilationAny source file containing CUDA language extensions must be compiled with NVCCNVCC is a compiler driverNVCC is a compiler driver

Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...

NVCC outputsC/C++ code (host CPU code)

Must then be compiled with the rest of the application using th t l

46

another toolPTX (Parallel Thread Execution)

Object code directlyOr, PTX source, interpreted at runtime

24

8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagramInput

Output

47

8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagramInput: A collection of numPoints colored 2D sites

Output: A map that colors the 2D (digital) space into various regions that “belong” to the sites

Processing:for each point p in the 2D space (height x width)

record the distance from p to site[0] as the min distancetake note of index 0 achieving the min distance

for each site from 1 to (numPoints-1)if (distance from p to this site is shortest so-far)

record this distance as the min distancetake note of this index for p

48

endIfendForrecord the closest site for p

endFor

Remarks: 1. Toggle between CPU & GPU options to check for correctness2. Toggle between CPU & GPU options to experience the speed

25

8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagram

glut_keyboard

C t * lt i l M d

cpuVoronoiDiagram cudaVoronoiDiagram

computeVoronoiDiagram

CPU

Convert *result using colorMap, and generate texture from outputTexture

*output is broken into 16x16 tiles, each for a thread block, and there are 2D blocks in a grid

49

kernelNaiveVoronoiDiagram kernalTextureVoronoiDiagramGPU

kernelSharedVoronoiDiagram

Session (3 of 8)Session (3 of 8)Day 1, Afternoon, 2:00pm – 3:30pm – Get it good

9. CUDA Hardware Implementation - Product vs Architecture - CUDA mode vs Graphics mode- Execution Model- Memory Model : Shared memory, Texture….

10.Optimization I- Thread Corporation (through [static] Shared Memory)

T t F t h

50

- Texture Fetch- CUDA: 4th PP – Max Distance Computation

11.Hands-on #3: 3rd PP – Voronoi Diagram (Part 2) - Modify to use shared memory, texture

Tea break: 3:30pm – 4:00pm

26

http://electronicdesign.com/article/embedded/simt-architecture-delivers-double-precision-terafl.aspx

9a. Tesla S1070 1U rack mount 9a. Tesla S1070 1U rack mount

- 4 Tesla cards in each system

- Each 1U connects to 2 PC systems

51

9a. Tesla C1060 9a. Tesla C1060

Desktop version of TeslaThis is for computing, and thus need a separate display card for PC240 CUDA Cores, 4GB memoryNewer version: C2050/C2070: 448 CUDA Cores, 3GB/6GB memory

52

27

99a. Fermi Architecturea. Fermi Architecture

53http://www.hwmania.org/forum/schede-video-thread-ufficiali/9776-nvidia-fermi-gtx-470-gtx-480-a.html

CUDA v2.0, v2.1512 CUDA Core

9a. Product 9a. Product vsvs Architecture Architecture

ProductArchitectureG80/G90

ArchitectureGT200

ArchitectureFermi

G Force(consumer)

e.g. 9800GTX e.g. GTX280

e.g.GTX480(consumer) GTX280 GTX480

Quadro(professional)

e.g.Quadro FX3700

e.g.FX5800

e.g. Quadro 6000

Tesla(server/desktopserver)

e.g. C1060, &S1070 (4 cards)

e.g. C2050,C2070

54

CUDACUDA Version 1.0, 1.1, 1.2 1.3 2.0, 2.1Max CUDA Core

128 240 512

Major feature 1.1: atomic (global memory)1.2: atomic (shared mem)

1.3: double precision 2.0: L1/L2 cache

28

9a. Specification of Different Architectures9a. Specification of Different Architectures

55http://www.hwmania.org/forum/schede-video-thread-ufficiali/9776-nvidia-fermi-gtx-470-gtx-480-a.html

9b. G80 9b. G80 Device (In Graphics Mode)Device (In Graphics Mode)For reference

Setup / Rstr / ZCullInput Assembler

Host

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

p

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

56

L2

FB

L1 L1 L1 L1 L1 L1 L1 L1

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

29

9b. G80 9b. G80 Device (In CUDA Mode)Device (In CUDA Mode)Thread Execution Manager issues threads128 Scalar Processors grouped into 16 Streaming Multiprocessors(SMs)Parallel Data Cache (Shared Memory) enables thread cooperation

Thread Execution Manager

Input Assembler

Host

Parallel DataParallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data

57

Load/store

Global Memory

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

9c. Execution 9c. Execution ModelModelKernels are launched in grids

One kernel executes at a timeA block executes on one multiprocessorA block executes on one multiprocessor

Does not migrateSeveral blocks can reside concurrently on one multiprocessor

Control limitations (of G8X/G9X GPUs):At most 8 concurrent blocks per SMAt most 768 concurrent threads per SM

58

Number is further limited by SM resourcesRegister file is partitioned among all resident threadsShared memory is partitioned among all resident thread blocks

30

Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1SM Version sm_10 sm_11 sm_12 sm_13 sm_20 sm_21Threads / Warp 32 32 32 32 32 32

9c. Execution Model (different CUDA 9c. Execution Model (different CUDA verver))

Threads / Warp 32 32 32 32 32 32Warps / Multiprocessor 24 24 32 32 48 48Threads / Multiprocessor 768 768 1024 1024 1536 1536Thread Blocks / Multiprocessor 8 8 8 8 8 8Max Shared Memory / Multiprocessor (bytes) 16384 16384 16384 16384 49152 49152

Register File Size 8192 8192 16384 16384 32768 32768

Register Allocation Unit Size 256 256 512 512 64 64

Allocation Granularity block block block block warp warp

Shared Memory Allocation Unit Size 512 512 512 512 128 128

Warp allocation granularity (for registers) 2 2 2 2

Max Thread Block Size 512 512 512 512 1024 1024

Shared Memory Size Configurations (bytes) 16384 16384 16384 16384 49152 49152

[note: default at top of list] 16384 16384

59

Registers

9d. CUDA : Memory Hierarchy9d. CUDA : Memory Hierarchy

Gl b l U h d

L2 / Texture memorycached, read-only

L1 / Shared memory per block

60

Global memory – Uncachedslow random/uncoalesced access

L1 :Shared memory (in Fermi) is configurable

Either 0K:48K, or 32K:16K

31

Each thread canR/W per-thread registersR/W per-thread local memory

99d. CUDA : Memory Modeld. CUDA : Memory Model

R/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRO per-grid constant memoryRO per-grid texture memory

Host can R/W global, constant and texture memory

61

Host

9d. CUDA : Memory Model9d. CUDA : Memory Model

62

32

9d. CUDA 9d. CUDA Memory SpacesMemory SpacesGlobal and Shared Memory

Most important, commonly usedLocal Constant and Te t re Memor for con enience /Local, Constant, and Texture Memory for convenience / performance

Local: automatic array variables allocated there by compilerConstant: useful for uniformly-accessed read-only data

Cached

63

Texture: useful for spatially coherent random-access read-only data

CachedProvides filtering, address clamping and wrapping

1010a. Thread a. Thread CooperationCooperationThreads in a thread block can cooperate

Sharing data through shared memorySynchronizing their execution (using __syncthreads())

Thread cooperation is valuableShare results to save computationShare memory accesses

Drastic bandwidth reduction

64

33

10a. CUDA : Thread 10a. CUDA : Thread Synchronization Synchronization FnFnvoid __syncthreads();

Synchronizes all threads in a blockGenerates barrier synchronization instructionNo thread can pass this barrier until all threads in the block reach itUsed to avoid RAW (read after write) / WAR / WAW hazards when accessing shared memory

To maintain the sequential order in concurrent environment

65

To maintain the sequential order in concurrent environment

10a. Thread 10a. Thread Blocks: Scalable CooperationBlocks: Scalable CooperationDivide monolithic thread array into multiple blocks

Threads within a block cooperate via shared memory, atomic operations and barrier synchronizationatomic operations and barrier synchronizationThreads in different blocks “cannot” cooperate

Enables programs to transparently scale to any number of processors

Thread Block 0 Thread Block 0 Thread Block N - 1

66


threadID

……float x = input[threadID];float y = func(x);output[threadID] = y;…


76543210 76543210 76543210

34

10a. Improving 10a. Improving Memory Memory BandwidthBandwidthGlobal memory has much higher latency (roughly 100x) and lower bandwidth than on-chip shared memory

Should minimize global memory accessCan stage data in shared memory before operated on

Typical programming pattern using shared memory. For each thread of a block

(1) Load data from global memory to shared memory(2) Synchronize so that it is safe to read shared memory locations written by other threads

67

(3) Process data in shared memory(4) Synchronize again if necessary to make sure shared memory has been updated with results(5) Write results to global memory

1010a. Using Shared Memorya. Using Shared MemoryStatic shared memory

__shared__ int2 tile[TILE_SIZE];

Dynamic shared memoryPass the amount through kernel invocation

Kernelname<<<gridsize, blocksize, sMemByte, …>>>

Inside the kernel, we have a pter initialized to the startextern __shared__ int2 *smem_ptr; // already initialized

68

float *smem_pt2 = (float *) (smem_ptr + 100); // offset

35

10a. Thread Cooperation from Diff Blocks ?10a. Thread Cooperation from Diff Blocks ?Threads in different thread blocks “can” cooperate but cannot synchronize

Cooperate using atomic operations (using atomic*())Cooperate using atomic operations (using atomic ())An atomic function performs a read-modify-write atomic operation, without interference from other threads

Cannot synchronize ALL threads within kernelMay cause deadlockNeed a lot of hardware resources to store the states of all

69

Need a lot of hardware resources to store the states of all unfinished threads

All threads of a kernel are synchronized at the end kernel execution or between kernel launches

10b. Texture Memory10b. Texture MemoryCUDA supports the use (reading) of texture memory (for graphics purposes)

U i t t j hi b fit N t th tUsing texture memory enjoys caching benefit. Note that in Fermi, this becomes less significant as global memory also allows caching

A texture can be any region of linear memory or a CUDA array

Texture can be 1D 2D 3D and composed of elements

70

Texture can be 1D, 2D, 3D and composed of elements, each of which has 1, 2, or 4 components that may be signed or unsigned 8-, 16- or 32-bit integers, 16-bit floats, or 32-bit floats.

36

10b. CUDA 10b. CUDA Texture TypesTexture TypesBound to linear memory

Global memory address is bound to a textureOnly 1DInteger addressingInteger addressingNo filtering, no addressing modes (clamping, repeat)

Bound to CUDA arraysBlock linear CUDA array is bound to a texture (in texture memory space)1D, 2D, or 3DFloat addressing (size-based or normalized)Filtering

71

Addressing modes (clamping, repeat)Bound to pitch linear

Global memory address is bound to a texture2DFloat/integer addressing, filtering, and clamp/repeat addressing modes similar to CUDA arrays

10b. Memory 10b. Memory Optimization Using TexturesOptimization Using TexturesTexture fetches vs. global memory reads

Texture fetches are cachedOptimized for 2D spatial localityUseful when threads of a warp do random read from locations that are close together in 2D

Textures can be used to avoid uncoalesced loads from global memoryAlso useful when it is awkward to organize data in shared

72

memory

37

10b. Using Texture Memory10b. Using Texture MemoryHost (CPU) code

Allocate/obtain memory (global linear/pitch linear, or CUDA array)y)Create a texture reference object

Currently must be at file-scopeBind the texture reference to memory/arrayWhen done, unbind the texture reference, free resources

Device (kernel) code

73

Fetch using texture referenceLinear memory textures: tex1Dfetch()Array textures: tex1D() or tex2D() or tex3D()Pitch linear textures: tex2D()

10b. Using Texture Memory10b. Using Texture MemoryDeclare a global variable at beginning of .cu file

texture<type> globalTexRef;

Before invoking kernel Bind allocatedData (cudaMalloc memory) to globalTexRef

cudaBindTexture(0, globalTexRef, allocatedData);Inside the kernel

Read data as:

74

Point = tex1Dfetch ( globalTexRef, locationNeeded);Return from kernel

cudaUnbindTexture(globalTexRef);

38

10b. Example10b. ExampleUnoptimized data shifts__global__ void shiftCopy(float *odata,

float *idata, int shift) {{

int xid = blockIdx.x * blockDim.x + threadIdx.x; odata[xid] = idata[xid+shift];

}

Optimized using texture fetches__global__ void textureShiftCopy(float *odata,

float *idata int shift)

75

float *idata, int shift) {

int xid = blockIdx.x * blockDim.x + threadIdx.x; odata[xid] = tex1Dfetch(texRef, xid+shift);

}

Significant before compute capability ver 1.2

10b. Performance 10b. Performance ComparisonComparison

Ver 1.0

Ver 1.3

76

39

10c. 410c. 4thth PP PP –– Distance Computation (1 of 4)Distance Computation (1 of 4)CPU version (inside maxDist.cu)

static void CPU_MaxDist( int2 *pointArray, int numPoints, float *maxDistArray )

{for ( int i = 0; i < numPoints; i++ ) {float maxSqrDist = 0.0f;const int2 *pi = &pointArray[i];for ( int j = 0; j < numPoints; j++ ){

const int2 *pj = &pointArray[j];float sqrDist = sqrf( pj->x - pi->x ) + sqrf( pj->y - pi->y );if ( sqrDist > maxSqrDist ) maxSqrDist sqrDist;

77

if ( sqrDist > maxSqrDist ) maxSqrDist = sqrDist;}maxDistArray[i] = sqrt( maxSqrDist );

}}

1010c. c. 44thth PP PP –– Distance Computation (Distance Computation (2 2 of of 44))GPU version without shared memory (inside maxDist.cu)

__global__ void GPU_MaxDist1(int2 *pointArray, int numPoints, float *maxDistArray )

{int tx = blockIdx.x * blockDim.x + threadIdx.x;float max = 0.0, temp = 0.0;int2 point = pointArray[tx];for(int i=0; i < numPoints; i++){

int2 point1 = pointArray[i];temp = sqrf(point1.x-point.x) + sqrf(point1.y-point.y);if (temp>max)

78

if (temp>max)max = temp;

}maxDistArray[tx] = sqrt(max);

}

40

10c. 410c. 4thth PP PP –– Distance Computation (3 of 4)Distance Computation (3 of 4)GPU version with shared memory (inside maxDist.cu)

__global__ void GPU_MaxDist2( int2 *pointArray, int numPoint, float *maxDistArray )

{{ __shared__ int2 tile[ TILE_SIZE ];int2 point = pointArray[blockIdx.x * blockDim.x + threadIdx.x];

float temp = 0.0, max = 0.0;for ( int a = 0; a < numPoints; a += TILE_SIZE){

tile[threadIdx.x] = pointArray[a + threadIdx.x];__syncthreads();for(int k =0; k < TILE SIZE; k++)

79

o ( t 0; < _S ; ){ int2 point1 = tile[k];

temp = sqrf(point1.x-point.x) + sqrf(point1.y-point.y);if( a + k < numPoints && temp > max ) max = temp;

}__syncthreads();

}maxDistArray[blockIdx.x*blockDim.x + threadIdx.x] = sqrt(max);

}

10c. 410c. 4thth PP PP –– Distance Computation (4 of 4)Distance Computation (4 of 4)GPU version with texture memory (inside maxDist.cu)

texture<int2> texPoints;// binding needed before coming into the kernel__global__ void GPU_MaxDist3( int numPoints, float *maxDistArray ){

int tx = blockIdx.x * blockDim.x + threadIdx.x;int2 myPoint, point;float temp = 0.0, max = 0.0;myPoint = tex1Dfetch( texPoints, tx);

for (int i = 0; i<numPoints; i++){

point tex1Dfetch( texPoints i);

80

point = tex1Dfetch( texPoints, i);temp = sqrf(myPoint.x-point.x) + sqrf(myPoint.y-point.y);if( temp > max ) max = temp;

}maxDistArray[blockIdx.x*blockDim.x + threadIdx.x] = sqrt(max);

}

41

11. Hands11. Hands--on #3: 3on #3: 3rdrd PP PP –– VD (VD (Part 2Part 2) ) Do the Voronoi Diagram exercise with shared memory (& texture if time permits)

• Assume numPoints is divisible by TILE size• One shared memory solution with 16x16 tile size

• The image is divided into 16x16 thread blocks• Threads in a block is to cooperate in reading in sites info in batches of 16x16 items for processing

• Specifically, within a thread block• We have point(tx, ty) with reference to the image• We have tid with reference to the position in a tile• Go thru all the sites in 16x16 batches as follows:

• thread with tid, read site i+tid into shared memory

81

thread with tid, read site i+tid into shared memory tid where i is in increment of 16x16

• __syncthreads()• find closest site among these 16x16 sites for thread tid

• Record found site to result[ty*width + tx]

Toggle between modes to check for correctness & efficiency

11. Hands11. Hands--on #3: 3on #3: 3rdrd PP PP –– VD (VD (Part 2Part 2) ) Do the Voronoi Diagram exercise with shared memory, & texture

tx=1

ty = 2

tid

Image

82

sites

batch 0 batch 1 ………

42

Session (4 of 8)Session (4 of 8)Day 1, Afternoon, 4:00pm – 5:30pm – Get it better

12.Hands-on#4: 5th PP – Parallel Prefix Operation13.Optimization II

a) Avoiding divergenceb) Avoiding idle threadc) “Algorithmic” optimization (& no bank conflict)d) Avoiding idle computatione) “auto” synchronization within a warp (without __syncthreads)

14.CUDA Occupancy Calculator

83

p y- No. thread per block, Amt shared memory, Amt registers

Home-sweet-home

12. Parallel Prefix Operation12. Parallel Prefix OperationThe operation takes a binary associative operator ⊕ and an array of n elements

[ ][a0, a1, ..., an–1]

and returns the array

[a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ... ⊕ an–1)]

Example:Input = [3 1 7 0 4 1 6 3]

84

Good example of a computation that seems inherently sequential, but in fact it has an (work) efficient parallel algorithm

Output = [3 4 11 11 15 16 22 25]

43

12. Naive 12. Naive Parallel ScanParallel Scan

85

12. Naive 12. Naive Parallel ScanParallel ScanA “in-place” naive parallel scan (incorrect)for d := 0 to log2n – 1 do

forall k in parallel do

A double-buffered version (correct)

pif k ≥ 2d then x[k] := x[k − 2d] + x[k]

for d := 0 to log2n do forall k in parallel do

if k ≥ 2d then

86

x[out][k] := x[in][k − 2d] + x[in][k] else

x[out][k] := x[in][k] swap(in,out)

44

12. Algorithm 12. Algorithm ComplexityComplexityStep complexity, S(n) = O(log n)

Work complexity, W(n) = O(n log n)( )l

Number of addition operations =Sequential scan performs O(n) additionsNaive parallel scan is not work-efficient

Work-efficient parallel scan algorithms exist

( ) ( )nnOnn

dd

2log

11 log22 =−∑ =

−

87

12. Hands12. Hands--on #4: 5on #4: 5thth PP PP –– Parallel PrefixParallel PrefixInput: X[0], X[1], X[2],…X[n]

Output: Y[0], Y[1], Y[2],…Y[n] s.t. Y[i] = X[0] + X[1] + … X[i]

E.g. Input: 5, 3, 2, 6, 7, 1, 1, 2Output: 5, 8, 10, 16, 23, 24, 25, 27

CPU ProgramY[ 0 ] = X[ 0 ];for (int i = 1; i < n; i ++)

88

{Y[ i ] = Y [ i – 1] + X[ i ];

}

45

Write a (naïve) GPU version“2” possibilities ??

Multiple rounds: call kernel with different step length in each round

12: Hands12: Hands--on #4: 5on #4: 5thth PP PP –– Parallel PrefixParallel Prefix

p p gOne round: call kernel to do different rounds in one go, use synchronization (??)

Questions: How big can the array be? What else can improve?If time permit, write a GPU version (2 kernels) using shared memory

Handle input as tile of 256 itemsDeclare __shared__ int tile[2][TILE_SIZE] for double bufferingCooperate to read in a tile of pointp pDo parallel prefix for tile with multiple rounds inside kernelWrite results to global memory, and also write (for thread 0) result of the last item to another array in global memory

Do (recursively) next higher level of 256 items (which is tile level)Add results of each tile back to lower level

89

1212. Parallel . Parallel Scan of Large ArraysScan of Large Arrays

90

46

13. Optimization II: Parallel Reduction13. Optimization II: Parallel ReductionReduce

out = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[n–1]where ⊕ is associativewhere ⊕ is associative

Parallel Prefix / Scanout[i] = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[i]

where ⊕ is associativeE.g. all-prefix-sums

91

13. Optimization II: Parallel Reduction13. Optimization II: Parallel ReductionCommon and important data parallel primitive

Easy to implement in CUDANaïve version uses global memory as in the parallel prefixHard to get it fast

Serves as a great optimization exampleWe’ll walk step-by-step through 5 different versionsDemonstrates several important optimization strategies

92

p p g

Based on Mark Harris’ “Optimizing Parallel Reduction in CUDA” tutorial

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

47

13. Optimization II: Parallel 13. Optimization II: Parallel ReductionReductionTree-based approach used within each thread block using shared memory

Need to be able to use multiple thread blocksTo process very large arrays

93

To keep all multiprocessors on the GPU busyEach thread block reduces a portion of the array

But how do we communicate partial results between thread blocks?

1313. Solution. Solution: Kernel Decomposition: Kernel DecompositionSolution: decompose into multiple kernels

Kernel launch serves as a global synchronization pointKernel launch has negligible HW overhead low SW overheadKernel launch has negligible HW overhead, low SW overhead

94

In the case of reductions, code for all levels is the sameRecursive kernel invocation

Therefore, we focus on reduction within a single thread block

48

13. Optimization 13. Optimization Goal and MetricGoal and MetricWe should strive to reach GPU peak performance

Choose the right metricGFLOP/s: for compute-bound kernelsBandwidth: for memory-bound kernels

Reductions have very low arithmetic intensity1 flop per element loaded (bandwidth-optimal)

Therefore we should strive for peak bandwidth

95

Therefore we should strive for peak bandwidth

Will use G80 GPU for this example384-bit memory interface, 900 MHz DDR384 * 1800 / 8 = 86.4 GB/s

13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing__global__ void reduce0(int *g_idata, int *g_odata) {

extern __shared__ int sdata[];

// h th d l d l t f l b l t h d// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();

// do reduction in shared memfor(unsigned int s=1; s < blockDim.x; s *= 2) {

if (tid % (2*s) == 0)sdata[tid] += sdata[tid + s];

96

sdata[tid] += sdata[tid + s];

__syncthreads();}

// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Solution to hands-on#4 with shared memory

49

13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing

97

__global__ void reduce0(int *g_idata, int *g_odata) {


// h th d l d l t f l b l t h d

13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing

// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();

// do reduction in shared memfor(unsigned int s=1; s < blockDim.x; s *= 2) {

if (tid % (2*s) == 0)sdata[tid] += sdata[tid + s]; Problem: highly

98


__syncthreads();}


}

divergent warps are very inefficient, and %

operator is very slow

50

13a. Performance 13a. Performance for 4M Element Reductionfor 4M Element Reduction

Note: Block Size = 128 threads for all tests

Divergent Threads (in the same warp)

Possibility #1: Two threads doing different things

P ibilit #2 S th d d thi d th th d thi

99

Possibility #2: Some threads do same thing, and the other do nothing … our example (not very serious)




13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved Addressing


// do reduction in shared memfor (unsigned int s=1; s < blockDim.x; s *= 2) {

int index = 2 * s * tid;if (index < blockDim.x)

100

if (index < blockDim.x) sdata[index] += sdata[index + s];

__syncthreads();}


}

51

13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved Addressing

101New Problem: Shared Memory Bank Conflicts (Session #5)

* Difficulty of indexing / addressing of elements appear here.

13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved AddressingS tid index effect

1 0 0 sdata[0] += sdata[1]

1 2 sdata[2] += sdata[3]



… … …

2 0 0 sdata[0] += sdata[2]



3 12 sdata[12] += sdata[14]

… … …

4 0 0 sdata[0] += sdata[4]


2 16 d t [16] d t [20]

102

2 16 sdata[16] += sdata[20]

3 24 sdata[24] += sdata[28]

… … …

8 0 0 sdata[0] += sdata[8]

1 16 sdata[16] += sdata[24]

2 32 sdata[32] += sdata[40]

3 64 sdata[64] += sdata[72]

… … …

52

13b. Performance 13b. Performance for 4M Element Reductionfor 4M Element Reduction

103

13c. Reduction 13c. Reduction #3: Sequential Addressing#3: Sequential Addressing

104Sequential addressing is shared memory bank conflict free

53




13c. Reduction 13c. Reduction #3: Sequential Addressing#3: Sequential Addressing


// do reduction in shared memfor (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s)sdata[tid] += sdata[tid + s];

105


__syncthreads();}


}

13c. Performance 13c. Performance for 4M Element Reductionfor 4M Element Reduction

106

54

13d. Problem13d. Problem: Idle Threads: Idle Threadsfor (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s)sdata[tid] += sdata[tid + s];

Problem: half of the threads are idle on first loop iterationThis is very wasteful

__syncthreads();}

107

13d. Reduction 13d. Reduction #4: First Add During Load#4: First Add During LoadHalve the number of blocks, and replace single load

// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;

With two loads and first add of the reduction

unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();

// perform first level of reduction,// reading from global memory, writing to shared memoryunsigned int tid threadIdx x;

108

unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];__syncthreads();

This is in effect handling 2 blocks of global memory by 1 block of threads. So, the total number of blocks of threads is half that of the number of global memory blocks.

This is reading the corresponding elements in the 2 blocks of global memory to add.

55

13d. Performance 13d. Performance for 4M Element Reductionfor 4M Element Reduction

109

13e. Instruction 13e. Instruction BottleneckBottleneckAt 17 GB/s, we’re far from bandwidth bound

And we know reduction has low arithmetic intensity

Therefore a likely bottleneck is instruction overheadAncillary instructions that are not loads, stores, or arithmetic for the core computationIn other words: address arithmetic and loop overhead

Strategy: unroll loops

110

56

13e. Reduction 13e. Reduction #5: Unroll The Last Warp#5: Unroll The Last WarpAs reduction proceeds, # “active” threads decreases

When s <= 31, we have only one warp left

Instructions are SIMD synchronous within a warp

That means when s <= 31We don’t need to __syncthreads()We don’t need “if (tid < s)” because it doesn’t save any work, and it does not affect the correctness

111

Let’s unroll the last 6 iterations of the inner loop

13e. Reduction 13e. Reduction #5: Unroll The Last Warp#5: Unroll The Last Warpfor (unsigned int s=blockDim.x/2; s>32; s>>=1) {

if (tid < s)sdata[tid] += sdata[tid + s];syncthreads();__ y ;

}

if (tid < 32) {sdata[tid] += sdata[tid + 32];sdata[tid] += sdata[tid + 16];sdata[tid] += sdata[tid + 8];sdata[tid] += sdata[tid + 4];sdata[tid] += sdata[tid + 2];

112

This saves useless work in all warps, not just the first one!Without unrolling, all warps execute every iteration of the for loop and if statement

sdata[tid] += sdata[tid + 1];}

57

1313e. Performance e. Performance for for 44M Element ReductionM Element Reduction

113

13. Further 13. Further OptimizationsOptimizationsFurther optimizations can be applied

E.g. complete loop unrolling, algorithm cascading

114

58

13. Parallel 13. Parallel Reduction Algorithmic ComplexityReduction Algorithmic ComplexityStep complexity, S(n) = O(log n)

Step complexity is the number of steps required to complete the algorithm given an infinite number of threads/processorsg g p

Work complexity, W(n) = O(n)Work complexity is the total number of operations the algorithm performsOur parallel reduction algorithm is work-efficient (i.e. same as optimal sequential algorithm)

Time complexity

115

Time complexityp is the number of available processorsWhen p → ∞, T(n) → S(n)When p = 1, T(n) = W(n)

14. CUDA Occupancy Calculator14. CUDA Occupancy CalculatorLimits on:

# of threads per blockAmount of shared memory# registers per thread

Occupancy: It is the ratio of active warps to the maximum number of warps supported on a multiprocessor SM.

Aim for 100% occupancy ?

116

Actually, should aim for all CUDA core to be busy

Demonstration on GPU Occupancy Calculatorhttp://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls

59

Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1SM Version sm_10 sm_11 sm_12 sm_13 sm_20 sm_21Threads / Warp 32 32 32 32 32 32

9c. Execution Model (different CUDA 9c. Execution Model (different CUDA verver))

Threads / Warp 32 32 32 32 32 32Warps / Multiprocessor 24 24 32 32 48 48Threads / Multiprocessor 768 768 1024 1024 1536 1536Thread Blocks / Multiprocessor 8 8 8 8 8 8Max Shared Memory / Multiprocessor (bytes) 16384 16384 16384 16384 49152 49152

Register File Size 8192 8192 16384 16384 32768 32768

Register Allocation Unit Size 256 256 512 512 64 64

Allocation Granularity block block block block warp warp

Shared Memory Allocation Unit Size 512 512 512 512 128 128

Warp allocation granularity (for registers) 2 2 2 2

Max Thread Block Size 512 512 512 512 1024 1024

Shared Memory Size Configurations (bytes) 16384 16384 16384 16384 49152 49152

[note: default at top of list] 16384 16384

117

14. CUDA Occupancy Calculator14. CUDA Occupancy Calculator

118

60

14. CUDA Occupancy Calculator14. CUDA Occupancy CalculatorMaximum number of register compiled option is:p

-maxregcount N

When there is not enough registers, variable will go into “local” memory, which i t ll th l b l

119

is actually the global memory

Leave it to compiler?

Session (5 of 8)Session (5 of 8)Day 2, Morning, 9:00am – 10:30am – More about Memory

1. Review of CUDA Hardware Implementation

[ Revisit #13 in Session #4]

[ Revisit #14 in Session #4]

2. Overall Optimization

3. Optimization III

120

- Coalescing Memory Access (Global memory)- Avoiding (shared) memory bank conflict

4. CUDA: 6th PP – Matrix Transpose

Break: 10:30am – 11:00am

61

1. Streaming 1. Streaming MultiprocessorsMultiprocessorsCUDA architecture consists of a scalable array of multi-threaded Streaming Multiprocessors (SM)All threads of each thread block are executedAll threads of each thread block are executed concurrently on one multiprocessor

But each multiprocessor can execute multiple blocks concurrently

Each multiprocessor has (in G80)1 multi-threaded instruction unit

121

8 Scalar Processor (SP) cores2 function units for transcendental functions (e.g. log, sin)On-chip memory –– registers, shared memory, constant cache, texture cache

1. Streaming 1. Streaming MultiprocessorsMultiprocessors

122

62

1. Thread Execution1. Thread ExecutionSIMT (single-instruction, multiple-thread) execution model

Multiprocessor creates, manages, schedules and execute threads in warpsthreads in warps

A SIMT warp is a group of 32 parallel threads

When executing a (common) instruction in a warp, multiprocessor maps threads to the ( 8 ) scalar processor cores in turn

One thread to one scalar processor core

123

A warp can be processed in 4 clock cycles

A block is always split into warps the same wayEach warp contains threads of consecutive, increasing thread IDs (0 31, 32 63, …)

1. Flow 1. Flow Control and BranchingControl and BranchingIndividual threads in a warp are free to branch independently

Can have diverging execution pathsCan have diverging execution paths

A warp executes one common instruction at a time, so full efficiency is achieved when all threads in a warp agree on the same execution path

If threads in a warp diverge, the warp serially executes each branch path taken, disabling threads that are not on

124

p , gthat path

63

1. Active 1. Active Threads and BlocksThreads and BlocksNumbers of threads and blocks that can be active at the same time in a multiprocessor are limited by size of register file and shared memoryg y

Each active thread is allocated with some registers for the entire lifetime of the threadEach active block is allocated with some shared memory for the entire lifetime of the block

125

1. Some Spec for 1. Some Spec for Compute Capability 1.1Compute Capability 1.1Maximum threads per block = 512Maximum block size = 512 x 512 x 64Maximum grid size = 65535 x 65535W i 32 th dWarp size = 32 threadsRegisters per multiprocessor = 8192Shared memory per multiprocessor = 16 KB (in 16 banks)Constant memory = 64 KBLocal memory per thread = 16 KBConstant cache = 8 KB per multiprocessorTexture cache = 6 KB to 8 KB per multiprocessor

126

Maximum active blocks per multiprocessor = 8Maximum active warps per multiprocessor = 24Maximum active threads per multiprocessor = 768Limit on kernel size = 2 million PTX instructionsSupport atomic functions on 32-bit words in global memory (1.1)

64

2. Overall 2. Overall Optimization StrategiesOptimization Strategies(1) Maximizing parallel execution

Restructure algorithm to achieve as much data parallelism as possibleMap to hardware by carefully choosing the execution configuration of each kernel invocation

(2) Optimizing memory usage to achieve maximum memory bandwidth

Different memory spaces and access patterns have vastly different performanceWe have a lot to say about this

127

We have a lot to say about this(3) Optimizing instruction usage to achieve maximum instruction

throughputUse high throughput arithmetic instructionsAvoid different execution paths within same warp

2. Execution 2. Execution ConfigurationConfigurationNumber of thread blocks should be larger than the number of multiprocessors

So that all multiprocessors have at least one block to executeShould have multiple blocks per multiprocessor so that it does not idle when a block synchronizes, or accessing memory

Threads per block should be multiple of warp size

Should have at least 192 (= 6 x 32) active threads per multiprocessor to hide register read-after-write latency

Block size is also limited by amount of register and shared memory

128

Block size is also limited by amount of register and shared memory used

65

3. Memory 3. Memory OptimizationsOptimizationsDistinguish between data intensive vs compute intensive

Data intensive: consider memory access carefullyCompute intensive: not so much, as there are always enough to work on

Possibilities:(1) Minimize data transfer between host and device(2) Global memory usage: ensure accesses are coalesced

whenever possible (coming up)

129

whenever possible (coming up)(3) Minimize global memory accesses by using shared

memory, constant memory, etc. (4) Shared memory usage: minimize bank conflicts in

accesses (coming up)

3. Data 3. Data Transfer Between Host and DeviceTransfer Between Host and DevicePeak bandwidth between device memory and GPU is much higher than that between host memory and device memoryy

Should minimize data transfer between host and device memory

Even if that means running kernels on GPU that do not have any speed-up over CPU

Use page-locked or pinned memory transfer

130

p g p y

Overlapping asynchronous transfers with CPU computations

66

3a. Coalesced 3a. Coalesced Access to Global MemoryAccess to Global MemorySimultaneous accesses to global memory by threads in a half-warp (16 threads) can be coalesced into as few as a single memory transaction of 32, 64 or 128 bytesg y , y

Requirements for coalescing are based on compute capability

(1) Compute capability 1.0 and 1.1(2) Compute capability 1.2 and higher

If a half warp fulfills requirements coalescing occurs

131

If a half-warp fulfills requirements, coalescing occurs even if the half-warp is divergent and some threads do not actually access memory

3a. Coalescing: Compute 3a. Coalescing: Compute Capability Capability 1.0 1.0 &1.1&1.1

Global memory accesses by all threads in a half-warp are coalesced if threads access

(1) 4-byte words, resulting in one 64-byte memory transaction(1) 4 byte words, resulting in one 64 byte memory transaction(2) 8-byte words, resulting in one 128-byte memory transaction(3) 16-byte words, resulting in two 128-byte memory transactions

All 16 words must lie in same segment of size equal to the memory transaction size (i.e. 64 bytes, 128 bytes, 256 bytes)

The kth thread in the half-warp must access the kth word

132

Not all threads need to participate

16 separate memory transactions are issued if requirements not fulfilled

Minimum memory transaction size is 32 bytes

67

3a. Examples 3a. Examples of of Coalesced Coalesced Memory AccessMemory Access

C CFor Compute Capability 1.0 and 1.1Left: coalesced floatmemory access, resulting in a single 64-byte memory transaction

133

Right: coalesced float memory access (divergent warp), resulting in a single 64-byte memory transaction

3a. Examples 3a. Examples of of NonNon--Coalesced Coalesced Memory AccessMemory Access

For Compute Capability 1.0 and 1.1

Left: non-sequential floatmemory access, resulting in 16 memory transactions

134

Right: access with a misaligned starting address, resulting in 16 memory transactions

68

3a. Coalescing: Compute 3a. Coalescing: Compute Capability 1.2 Capability 1.2 & Higher& HigherA global memory transaction for a half-warp is determined based on the following rules

Find memory segment that contains the address requested by lowest numbered active threadlowest numbered active thread

Segment size is 32 bytes for 1-byte data, 64 bytes for 2-byte data, 128 bytes for 4-, 8- and 16-byte data

Find all other active threads whose requested address lies in the same segmentReduce the transaction size, if possible

If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes

135

If the transaction size is 64 bytes and only the lower or upper half is used, reduce the transaction size to 32 bytes

Carry out the transaction and mark the serviced threads as inactiveRepeat until all threads in the half-warp are serviced

3a. 3a. EgEg of of Global Global Memory AccessMemory Access

For Compute Capability 1.2 and higherand higher

Left: random float memory access within a 64B segment, resulting in one memory transaction

Center: misaligned floatmemory access, resulting in

t ti

136

one transaction

Right: misaligned floatmemory access, resulting in two transactions

69

3b. Shared 3b. Shared Memory and Memory BanksMemory and Memory BanksVery fast on-chip memory

Can be used to avoid non-coalesced global memory accesses

Can be used to reduce global memory accessesCan be used to reduce global memory accesses

Shared memory is organized into 16 banks in G80 (32 banks in Fermi), where successive 4-byte words are assigned to successive banks

Memory load or store of n addresses by a half-warp that span n distinct memory banks can be serviced simultaneously

137

distinct memory banks can be serviced simultaneously

If multiple addresses map to the same memory bank, the accesses are serialized

If multiple read for the same memory address, a broadcast occurs

3b. 3b. EgEg Without Without Bank ConflictsBank Conflicts

Left: Sequential access

Right: Random permutation

138

70

3b. 3b. EgEg With With Bank ConflictsBank Conflicts

Left: 2-way bank conflicts

Right: 8-way bank conflicts

139

3b. 3b. EgEg With With BroadcastBroadcast

Left: No conflict

Right: No conflict if word in Bank 5 is chosen for broadcast first, or 2-way conflicts otherwise

140

71

3b. A 3b. A Common Technique (Trick)Common Technique (Trick)Example: The following causes 16-way bank conflict if TILE_DIM is multiple of 16...__shared__ float tile[ TILE_DIM ][ TILE_DIM ];...float a = tile[ threadIdx.y ][ column ];

The bank conflict can be removed entirely by padding shared memory array with an extra column

141

...__shared__ float tile[ TILE_DIM ][ TILE_DIM + 1 ];...float a = tile[ threadIdx.y ][ column ];

3b. Shared 3b. Shared Memory StrategyMemory StrategyBetter coalescing means less time waiting for data, which means less latency

I l h ld b k th blIn general, you should break up the problem so you can do coalesced copies into shared memory

Then do processing in shared memory without bank conflicts

Then do a coalesced copy back into global memory

142

72

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeSimple direct copy (not yet transposed)

TILE_DIM = 32BLOCK ROWS = 8_Matrix is of dimension width x height

__global__ voidcopy(float *odata, float* idata, int width, int height){

int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

143

int index = xIndex + width * yIndex;

for (int i=0; i < TILE_DIM; i += BLOCK_ROWS) odata[index + i*width] = idata[index + i*width];

}

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeDirect copy using shared memory (not yet transposed)

__global__ void copySharedMem(float *odata, float *idata, int width, int height)copySharedMem(float odata, float idata, int width, int height){

__shared__ float tile[TILE_DIM][TILE_DIM];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;int index = xIndex + width*yIndex;for (int i=0; i < TILE_DIM; i += BLOCK_ROWS)tile[threadIdx.y+i][threadIdx.x] = idata[index + i*width];

144

__syncthreads();for (int i=0; i < TILE_DIM; i += BLOCK_ROWS)odata[index + i*width] = tile[threadIdx.y+i][threadIdx.x];

}

73

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeNaïve transpose

__global__ void transposeNaive (float *odata, float* idata, int width, intheight)

{int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;

int index_in = xIndex + width * yIndex;int index out = yIndex + height * xIndex;

145

int index_out = yIndex + height * xIndex;

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) odata[index_out + i] = idata[index_in + i*width];

}

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeTranspose with coalesced (and with bank conflicts)

__global__ void transposeCoalesced (float *odata, float *idata, int width, intheight)height){ __shared__ float tile[TILE_DIM][TILE_DIM];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index out = xIndex + (yIndex)*height;

146

_ y g

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x]=idata[index_in+i*width];

__syncthreads();for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)

odata[index_out+i*height]=tile[threadIdx.x][threadIdx.y+i];}

74

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeTranspose without bank conflicts

__global__ void transposeNoBankConflicts (float *odata, float *idata, int width, int height)int height){ __shared__ float tile[TILE_DIM][TILE_DIM + 1];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;

xIndex = blockIdx.y * TILE_DIM + threadIdx.x;yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index out = xIndex + (yIndex)*height;

147

_ y g ;

for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x]=idata[index_in+i*width];

__syncthreads();for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)

odata[index_out+i*height]=tile[threadIdx.x][threadIdx.y+i];}

4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix Transpose

148

75

Session (6 of 8)Session (6 of 8)Day 2, Morning, 11:00am – 12:30pm – More on Correctness

5. Review on thread cooperation- CUDA 7th PP – Histogram (Demo)- Cooperation within & across block - Atomic functions

6. Hands-on #5: 8th PP – Centroidal Voronoi Diagram

7. Nvidia Profiler Demo (by Cao Thanh Tung)

L h 12 30 2 00

149

Lunch: 12:30pm – 2:00pm

5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (1 of 2) Histogram (1 of 2) __global__ void kernelHistogram64(int *d_Histogram,

uchar *d_Data, int n){

__shared__ uchar s_hist[BLOCK][HISTOGRAM_SIZE + 1]; __shared__ int s_sum[HISTOGRAM_SIZE];

// Each thread initialize its histogramfor (int i = 0; i < HISTOGRAM_SIZE; i++)

s_hist[threadIdx.x][i] = 0;

int totalThreads = gridDim.x * blockDim.x; int tid = blockIdx.x * blockDim.x + threadIdx.x; // Each thread work on its own set of data

150

// Each thread work on its own set of datafor (int i = tid; i < n; i += totalThreads)

s_hist[threadIdx.x][d_Data[i]]++;

__syncthreads();

76

5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (2 of 2) Histogram (2 of 2)

// Merge the sub-histograms_sum[threadIdx.x] = s_hist[0][threadIdx.x]; for (int i = 1; i < blockDim.x; i++)for (int i = 1; i < blockDim.x; i++)

s_sum[threadIdx.x] += s_hist[i][threadIdx.x];

__syncthreads();

// Merge with the global histogram//d_Histogram[threadIdx.x] += s_sum[threadIdx.x];

151

}

5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (2 of 2 Histogram (2 of 2 correctedcorrected) )

// Merge the sub-histograms_sum[threadIdx.x] = s_hist[0][threadIdx.x]; for (int i = 1; i < blockDim.x; i++)for (int i = 1; i < blockDim.x; i++)

s_sum[threadIdx.x] += s_hist[i][threadIdx.x];

__syncthreads();

// Merge with the global histogram// Without using atomic --> Wrong// d_Histogram[threadIdx.x] += s_sum[threadIdx.x];

152

// With atomic --> CorrectatomicAdd(&d_Histogram[threadIdx.x],s_sum[threadIdx.x]);

}

77

5b. Cooperation within & across block5b. Cooperation within & across blockCooperation within a block: use shared memory

Cooperation across block difficultiesSituation #1: Some block may be dead (completed) before some that are born (started)Situation #2: More than one block want to access the “same” information – race condition

Cooperation across block: use global memory that is common to all blocks

153

common to all blocks

5b. Race Condition5b. Race Condition

Eg. Two threads try to increase a global counter

Thread 1: R1 = counter

R1 R1 + 1

counter = R1

Thread 2: R2 = counter

R2 R2 + 1

counter = R2

154

Incorrect result !

78

5c. Atomic Functions5c. Atomic FunctionsAn atomic function performs a read-modify-write atomic operation on one

32-bit or 64-bit word residing in global or shared memory.atomicAdd( address, val) … addition

atomicSub( address, val) … subtraction

atomicExch(address, val) … exchange old with val

atomicMin( address, val) … min of old and val

atomicMax( address, val) … max of old and val

atomicInc( address, val) … (old >= val) ? 0 : (old+1)

atomicDec( address, val) … ((old == 0) | (old > val)) ?

val : (old – 1)

155

atomicCAS( address, compare, val) … compare and store

old==compare ? val:old

bitwise atomicAnd( address, val)… ( old & val )

bitwise atomicOr( address, val) … ( old | val )

bitwise atomicXor( address, val)… ( old ^ val )

5c. Atomic Functions5c. Atomic FunctionsatomicAdd( address, val) … addition

* read address for old

* add to old the val

* store result to address

--- all instruction done in one transaction

atomicCAS( address, compare, val) … compare and store

old==compare ? val:old

* read address for old

* check old with compare

156

* if same, store val, else keep the old value

* return old

--- all instruction done in one transaction

A way of implementing semaphore : expecting compare, else don’t do anything

NOTE: no locking mechanism available…

79

Input: A collection of numPoints colored 2D sites

Output: A map that colors the 2D (digital) space into various regions that “belong” to the sites, with each site at the centroidal of each region.

6. Hands6. Hands--on #5 : 8on #5 : 8thth PP PP –– CentroidalCentroidal VD VD

Processing:repeat until each site at the centroidal of each region

1. compute Voronoi diagram for sites (previous hands-on)

2. for each pixel (i,j) docompute the sum to the centroid for its region as

2.1. int k = voronoi[id]; // id is a running number2.2. sumVectors[k].x += i;2 3 [k] j2.3. sumVectors[k].y += j;2.4. weights[k]++;

(in kernels.cu)

3. execute movePoints() to move to centroid for each site (in main.cpp)

endRepeat

157

6. Hands6. Hands--on #5 : 8on #5 : 8thth PP PP –– CentroidalCentroidal VD VD

glut_idle

Tasks: (1) Complete kernelCentroidsNoAtomic(2) Complete kernelCentroidAtomic

computeVoronoiDiagram

cudaComputeCentroids

g _

CPUcpuComputeCentroids

computeCentroids movePoints glutPostRedisplay

158KernelCentroidsNoAtomic KernelCentroidsAtomic

GPUkernelSharedVoronoiDiagram

CPU

80

7. Demo: 7. Demo: NvidiaNvidia ProfilerProfilerHelps measure and find potential performance problem

GPU and CPU timing for all kernel invocations and memcpysmemcpysIt has access to hardware performance counters

Demo by Cao Thanh Tung

159

Session (7 of 8)Session (7 of 8)Day 2, Afternoon, 2:00pm – 3:30pm – PP Primitive

8. CUDA: 9th PP – Sum of Array- threadfence ( )

9. Optimization IV: Basic Parallel Primitives- Reduction- Parallel Prefix- Compaction- Sorting

160

- Working with CUDA Libraries (Visual Studio)- CUDA: 10th PP – sorting (demo ) w/o code

10.Hands-on #6: 10th PP – Sorting an image with CUDPP

Tea break: 3:30pm – 4:00pm

81

8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (1 of 3(1 of 3))Want just a single kernel to compute the sum of an array of N numbers (not a practical one, but just for the sake of discussion)

Each block first sums a subset of array and stores result in yglobal memoryWhen all blocks are done, the last block done reads these partial sums from global memory and sums them to obtain final resultTo determine which block is last, each block atomically increments a counter

161

The last block is the one that receives counter value equal to gridDim.x – 1A memory fence ( __threadfence( ) ) is used to ensure partial sum is stored before counter is incremented

Cheaper than __syncthreads() operation, as it is synchronize within a warpFrom CUDA C Programming Guide Version 3.2, Appendix B.

8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (2 of 3(2 of 3))__device__ unsigned int count = 0;__shared__ bool isLastBlockDone;

__global__ void sum( const float* array, unsigned int N, fl t* lt)float* result)

{// Each block sums a subset of the input arrayfloat partialSum = calculatePartialSum(array, N);

if (threadIdx.x == 0) {// Thread 0 of each block stores partial sum to global memoryresult[blockIdx.x] = partialSum;// Thread 0 makes sure its result is visible to// all other threads

162

// all other threads__threadfence();// Thread 0 of each block signals that it is doneunsigned int value = atomicInc(&count, gridDim.x);// Thread 0 of each block determines if its block is// the last block to be doneisLastBlockDone = (value == (gridDim.x - 1));

}

82

8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (3 of 3(3 of 3))// Synchronize to make sure that each thread reads// the correct value of isLastBlockDone__syncthreads();

if (i L tBl kD ) {if (isLastBlockDone) {// The last block sums the partial sums// stored in result[0 .. gridDim.x-1]float totalSum = calculateTotalSum(result);

if (threadIdx.x == 0) {// Thread 0 of last block stores total sum// to global memory and resets count so that// next kernel call works properlyresult[0] = totalSum;

163

result[0] = totalSum;count = 0;

}}

}

9. Optimization IV: Basic Parallel Primitives 9. Optimization IV: Basic Parallel Primitives We have previously covered parallel prefix (Session #4)

Examples of Parallel Primitives:ReductionReductionParallel PrefixCompactionSorting

CUDPP is the CUDA Data Parallel Primitives Library.

CUDPP is a library of data-parallel algorithm primitives

164

Primitives such as these are important building blocks for a wide variety of data-parallel algorithms:

83

99. Definitions . Definitions of Basic of Basic OperationsOperationsReduce

out = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[n–1]where ⊕ is associativewhere ⊕ is associative

Parallel Prefix / Scanout[i] = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[i]

where ⊕ is associativeE.g. all-prefix-sums

(Stream) Compaction

165

out[0 .. m–1] = subset of in[0 .. n–1] that satisfy some conditionwhere m ≤ n

SortNon-decreasing, non-increasing, stable, non-stable, in-placeRadix sort

9. Compaction9. CompactionStream compaction is a filtering operation

Takes an input vector vi and a predicate p, and outputs only those elements in v for which p(v ) is true preservingonly those elements in vi for which p(vi) is true, preserving the ordering of the input elements

166

Parallel stream compaction can be implemented (by our own – non-optimized) using:

Scan and scatter

84

9. Compaction: Using 9. Compaction: Using Scan and ScatterScan and Scatter

(this is a pre-scan)

167

No conflict here because only elements “C” and “G” can write to output array

9. Working with CUDPP (1 of 3)9. Working with CUDPP (1 of 3)

168CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html

85

9. Working with CUDPP (2 of 3)9. Working with CUDPP (2 of 3)Easy stuff: include cudpp32.lib in your system for compilation, include cudpp32_xxxx.dll from CUDA SDK directory

Steps to getting it working (e.g. scan):p g g g ( g )1. Configure cudpp to work with a plan

169CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html

9. Working with CUDPP (3 of 3)9. Working with CUDPP (3 of 3)Steps to getting it working (e.g. scan):

2. Once okay, run the plan to do the scan

3. Next is to read the result back

170

4. Finally, clean up the memory used

86

10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– Sorting ColorsSorting Colors

Demo of sorting of image colors (rgba) for each(rgba) for each row of the image in increasing order

Perform the above sorting steps using the image processing program

Option: 1 (original) Option: 2 (inverted)

171

Results of the 4 options are shown on the left.

Option: 3 (sorted by color) Option: 4 (sorted by color in each row)

CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html

10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– Sorting ColorsSorting Colors

glut_keyboard

invertImage

cudaSortImage

CPU

* the usual setting up* prepare sorting plan

sortImage

initializeRows cudppSort(applied twice)

GPU

CPU

kernelInvertImageEach pixel is given a row number to facilitate 2 rounds of sorting

172

87

10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– sorting colorssorting colors

10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– sorting colorssorting colors

88

Session (8 of 8)Session (8 of 8)Day 2, Afternoon, 4:00pm – 5:30pm – Roundup

11.Roundup Optimization Lecture by Prof Hwu Wen mei- Lecture by Prof Hwu Wen-mei

12. Roundup Theory of Parallel Computation- Amdahl’s Law- Gustafson’s Law- NC - Work Efficiency

13 Roundup Parallel Computational Thinking

175

13. Roundup Parallel Computational Thinking14. Outlook

Keep-in-touch!

11. What we did not cover?11. What we did not cover?Here are some sample of stuff not covered:

C/C++ Runtime for CUDA vs CUDA Driver APIMemory optimizations

E.g. Data Transfer between host and devicePinned memory & zero copy Asynchronous transfers (both CPU & GPU are busy)

Global memory bank conflicts CUDA array (for textures)

CUDA and OpenGL/DirectX interoperability

176

CUDA with OpenGL – only a special type of texture (RGBA, 32 bits x 4) that can be processed w/o reading and writing from one to another

Multi-GPUsWindow OS limits computation to 5 sec; it uses 100-200MB for display purposes.

89

11. Roundup Optimization11. Roundup OptimizationParallel Problem Solving: (1) Sequential solution (2) Sequential solution with the right data structure (3) parallel solution (4) parallel solution with work (algorithmic) optimization (5) parallel solution with hardware considerationhardware consideration

Lecture by Prof Hwu Wen-meihttp://www.gpucomputing.net/?q=node/2376, also: Books: GPU Computing Gems, vol .1 & 2

8 Techniques in many-core programming Scatter to gather transformationPrivatizationGranularity coarsening

177

G a u a y coa se gData tiling/reuseData layout and traversal orderingInput data binningBin compactionRegularization

11. Eight Techniques in Programming11. Eight Techniques in Programming#1: Scatter to Gather: prefer gather for computation

Example: Jump Flooding Algorithm (video)

178

90

11. Eight Techniques in Programming11. Eight Techniques in Programming#2: Privatization : first do your own, then combine

Example: Histogram (session #7)

179

11. Eight Techniques in Programming11. Eight Techniques in Programming#3: Granularity coarsening : less threads to re-use computation

Example: Potential energy calculation

180

91

11. Eight Techniques in Programming11. Eight Techniques in Programming#4: Data Tiling/reuse: using shared memory

Example: matrix multiplication

181

11. Eight Techniques in Programming11. Eight Techniques in Programming#5: Data Layout & Traversal ordering :

Memory optimization examples

182

92

11. Eight Techniques in Programming11. Eight Techniques in Programming#6: Input Data Binning:

Example: gHull

183

11. Eight Techniques in Programming11. Eight Techniques in Programming#7: Bin Compaction

184

93

11. Eight Techniques in Programming11. Eight Techniques in Programming#8: Regularization

Example: graph algorithm (dynamic amount of work)

185

12a. Amdahl’s Law12a. Amdahl’s LawAmdahl’s Law: used to find the max expected improvement to an overall system when only part of the system is improved.

Assumption: Fixed the problem size (or assume the proportion of the p p ( p psequential portion does not change with problem size)

S is the possible speedupp is the fraction of the program that is parallelizableN is the number of parallel processors

S = 1 / [ (1 – p) + p/N ] which isS = 1 / (1 – p) when N goes to infinity

186

S 1 / (1 p) when N goes to infinityE.g. if p is 90%, then S is at most 10x speedup

http://en.wikipedia.org/wiki/Amdahl's_law

94

12a. Amdahl’s Law12a. Amdahl’s Law

187http://en.wikipedia.org/wiki/Amdahl's_law

12a. Gustafson’s Law12a. Gustafson’s LawGustafson’s Law: says that problems with large, repetitive data sets can be efficiently parallelized.

Assumption: Problem size can go very big that the sequential portion p g y g q pbecomes insignificant

S is the possible speedup : S = SequentialTime / ParallelTimeTime taken in a parallel machine of N processors is normalized to 1

There are two components, sequential-part (a) + parallel-part (b):

a + b = 1 The above time converted to sequential machine is then:

188

The above time converted to sequential machine is then:

a + N b = a + N (1 – a) So, speedup is:

S = [ a + N (1 – a) ] / 1 = a + N (1 – a) which isS = N when sequential-part is insignificant

http://en.wikipedia.org/wiki/Gustafson%27s_Law

95

12b. Amdahl’s 12b. Amdahl’s vsvs Gustafson’s Gustafson’s Amdahl’s Law

“ Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter h f t d i th l t h lf it i i ibl t hi 90 hhow fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city. Since it has already taken you 1 hour and you only have a distance of 60 miles total; going infinitely fast you would only achieve 60 mph. ”

Gustafson’s Law“ Suppose a car has already been traveling for some time at less than

90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph no matter how long or how

189

speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled. For example, if the car spent one hour at 30 mph, it could achieve this by driving at 120 mph for two additional hours, or at 150 mph for an hour, and so on.”

http://en.wikipedia.org/wiki/Gustafson%27s_Law

12b. Amdahl’s 12b. Amdahl’s vsvs Gustafson’s Gustafson’s vsvs NvidiaNvidiaGustafson's Law argues that even using massively parallel computer systems does not influence the serial part and regards this part as a constant one.

Amdahl’s Law is resulted from the idea that the influence of the serial part grows with the number of processes.

Nvidia refers to practically what do we have…The above laws neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth (which generally do not scale very well with the number of processors)….recall bank conflict, coalescing etc. Take the laws with a grain of salt : we do not always parallelize a

190

Take the laws with a grain of salt : we do not always parallelize a sequential version of the algorithm

So, what is the best that we can do? Next slide : more theory….

96

12c. NC 12c. NC The class NC (for “Nick’s Class”) is the set of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial # of processors

A problem is in NC if there exist constants c and k such that it can be solved in time O(logcn) using O(nk) parallel processorsStephen Cook coined NC after Prof Nick Pippenger

NC is a subset of PIt is unknown whether NC = P, but most researchers suspect this to be false

191

Parallel computer means “parallel, random-access machine” (PRAM). It can be CRCW, CREW, EREW.

Many development in 1970s, 1980s on PRAM models and results….

Why polylogarithmic time?? logarithmic is best in parallel vs linear is best in sequential

http://en.wikipedia.org/wiki/NC_(complexity)

12d. Work Efficient 12d. Work Efficient Besides the (poly)logarithmic time, we should look for work efficient algorithm (also mean power efficient)

Work = Total-Time x Work-Per-Time-Stepp

Eg. Jump Flooding Algorithm (video)

Time is O(log n) where there are n x n pixels

192

( g ) p

Is this work efficient ? Work = O(log n) x n x n = n2 log n What is the consequence?

Parallel Banding Algorithm (PBA)

Other example: Work Efficient Parallel Prefix

97

13. Roundup Parallel Computational Thinking13. Roundup Parallel Computational ThinkingComputational thinking is the thought process of formulating domain problems in terms of computation steps/algorithms.

Skill set to be effective computational thinker (from Kirk & Hwu):p ( )Computer architecture – memory organization; caching & locality, memory bandwidth; SIMT, SPMD, SIMD; floating point precision vs accuracyProgramming models – types of available memories, concepts to think through the data organization and loop structuresAlgorithm techniques – understanding tiling, cutoff, binning, etc.

193

as toolbox, and implications of scalability, efficiency and memory bandwidthDomain knowledge of the problems – deviation from the sequential computation may be possible

13. Roundup Parallel Computational Thinking13. Roundup Parallel Computational ThinkingChallenge Issue Attempt Coverage Remark

CorrectnessAlways produce the correct answer each time it runs

(1) Indexing mechanism * divide logically the global memory into logical parts to be processed by blocks of threads

Session #1: vector addition (simple)Session #2: image processing (simple)Session #5: matrix transpose (complex)

* error-prone in programming* wishing for a visual programming

tool

(2) Race condition * use of atomic functions Session #6: atomic examples * evolving with technology* wishing for help by compiler

PerformanceAlways run much faster than a sequential solutionwithin the available computing resources

(1) Algorithm optimization

Regularize work, balance work

* high level redesign of (work efficient) algorithm

* low level adjustment to thread assignment to part of global memory

gHull, pba by G3 LabSession #4: parallel prefix (ver 3)

Session #7: using primitive routine

* limited by our imagination

(2) Data structure /memory optimization

Regularize & localize data, no conflict in update

* use of intermediate memory (texture, shared, etc.)

* avoid (shared memory) bank conflicts* plan for coalesced (global) memory

transactions

Session #3: maxdist, Voronoi diagramSession #5: matrix transpose Session #5: matrix transpose

* wishing for a smart compiler

(3) Limits from Theory * Amdahl’s Law* G f ’ L S i #8 l th

* any other theory?

194

* Gustafson’s Law* NC

Session #8: general theory

ScalabilityAlways can handle growing amounts of work in a graceful manner

Data too big to fit into GPU

* related to efficiency * batch processing* out-of-core processing

* no particular research here

98

14. Outlook14. OutlookUS Stock: if you buy, check out the trend of nVidia, and the (opposite?) trend of Intel ☺ There are many more parallel computers in the recent year than that was sold in the past 30 years. GPU takes root.

Software Software companies are “re architecturing” software to takeSoftware companies are re-architecturing software to take advantage of GPUs

Professionals Upcoming generation of software engineers should be thinking in parallel to be really competitive

TechnologyConvergent of GPU with CPU

195

gOpenCL vs CUDA (vs BrookGPU, etc.) nVidia vs AMD/ATI, nVidia vs Intel

ResearchBetter compiler, visual development toolkits?Better understanding of parallel problem solving strategiesDiscrete (GPU) vs Continuous (real) problem understanding

15. Q & A15. Q & A

196

99

ReferencesReferencesNVIDIA CUDA Programming Guide, Version 3.1

http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf

NVIDIA CUDA C Programming Best Practices Guide, Version 3.1http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_BestPracticesGuide_3.1.pdf

Mark Harris, Optimizing Parallel Optimizing Parallel Reduction in CUDAhttp://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdfCan also be found in CUDA SDK projects\reduction\doc folder

GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDAhttp://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html

GPU G Ch t 37 A T lkit f C t ti GPU

197

GPU Gems, Chapter 37: A Toolkit for Computation on GPUshttp://http.developer.nvidia.com/GPUGems/gpugems_ch37.html

GPU Gems 2, Chapter 32: Taking the Plunge into GPU Computing, Section 32.3http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.html

GPU Gems 2, Chapter 36: Stream Reduction Operations for GPGPU Applications, Section 36.1http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter36.html

gpu programming workshop (nus extension)tants/cuda-course-nov-2010/...6 3. differences in...

Documents