gpu programming workshop (nus extension)tants/cuda-course-nov-2010/...6 3. differences in...
TRANSCRIPT
1
GPU Programming WorkshopGPU Programming Workshop(NUS Extension)(NUS Extension)
18-19 November 2010
Solving problems with thousands of “CPU”s
18 19 November 2010
Instructor: Dr Tan Tiow Seng ([email protected])http://www.comp.nus.edu.sg/~tants
Assistants: Cao Thanh Tung, Tang Ke
Session (Session (1 1 of of 88))Day 1, Morning, 9:00am – 10:30am - Introduction
1. Why we need GPU (thousands of “CPUs”) ?
2. Demo of GPU Examples
3. From Sequential to Parallel Computational Thinking
4. How to work with GPU? - CPU “talks” to GPUs through global memory- GPUs “perform” the work with kernel (CUDA - an
2
GPUs perform the work with kernel (CUDA an extension of C/C++)
- CUDA: 1st PP – Vector Addition5. CUDA “Concepts” of threads, blocks and grid
Break: 10:30am – 10:50am
2
1. Why do we need GPU?1. Why do we need GPU?
3
CPU vs GPU Performance Animation from 2000 till 2009 http://www.youtube.com/watch?v=eH96JE-CnHw
Fermi architecture discussion (video) http://www.nvidia.com/object/fermi_architecture.html
1. Why do we need GPU?1. Why do we need GPU?
4
3
1. Why do we need GPU?1. Why do we need GPU?
5
1. Why do we need GPU?1. Why do we need GPU?CPU: Moore’s Law
GPU: Moore’s Law Cubed (?)( )
Schematic different between CPU and GPUInterview with David Kirk: http://www.gamespot.com/news/2836326.html
6
Analogy:Developed Country Style Developing Country Style
Who is winning (country) currently? http://gpgpu.org/2010/10/28/nvidia-tesla-gpus-power-worlds-fastest-supercomputer#more-2931 : Tianhe-1A, 2.507petaflops
4
It is all about Performance & ScalabilityScalability is also key to power efficiency
1. Why do we need GPU?1. Why do we need GPU?
“Development” (Programming) is to be done differently
Key Techniques:To regularize work & data for massively parallel executionTo localize data for conserving memory bandwidth
Key HurdlesKey HurdlesSerialization due to conflicting updateOver subscription of resourcesLoad imbalance among threads
http://www.gpucomputing.net/?q=node/2376 7
2. Demo of GPU Examples2. Demo of GPU ExamplesGraphics with GPU
Shadow computationR l ti d iReal-time renderingMaya©, Photoshop ©, etc.
General Purpose GPU www.gpgpu.orgMatlab©
8
Matlab© Geometric ComputationWen-mei Hwu’s talk….
http://www.gpucomputing.net/?q=node/2376
David Kirk & Wen-mei Hwu: There are many more parallel computers in the recent year than that was sold in the past 30 years.
5
2. Demo of GPU Examples2. Demo of GPU ExamplesExamples in CUDA SDK
Particle systemImage denoisingImage denoisingSmoke particle
G3 Research LabShadow computation videoJump Flooding Algorithm (JFA) : Voronoi Diagram videoParallel Banding Algorithm (PBA) video2D Delaunay Triangulation (gpuDT) video3
9
3D gHull video3D gpu3DT
http://www.comp.nus.edu.sg/~tants
3. From Sequential to Parallel 3. From Sequential to Parallel Computational Thinking Computational Thinking
Direct translation from sequential to parallel threadsExample problem: known input to known output (location) …need gather rather scatter (Vector addition)…need gather rather scatter (Vector addition)Example problem: known input to unknown output (location) …need privatization (Histogram)
More than the above…one possible approach(1) Sequential solution (2) Sequential solution with the right data structure (3) parallel solution (4) parallel solution with work (algorithmic) optimization (5) parallel
10
solution with work (algorithmic) optimization (5) parallel solution with hardware considerationPossibility of a different sequential thinking
Any algorithmic paradigms Will revisit this point at the end http://www.gpucomputing.net/?q=node/2376
6
3. Differences in Computational Thinking 3. Differences in Computational Thinking Sequential Computation Thinking
Do not care much about the “machine” (in general)Avoid any wastage of computing at all time Just do it; no one to cooperate – work to be done, has to be done still. Nothing to ; p , gcomplain – just one person on the projectOptimization should be invariant with time….
Multi-core Parallel Computational ThinkingStill ok to ignore the underlying architectureRedundant work still undesirableCooperation between “workers” becomes a problem, who does what, avoids overlapping/conflicting of interest – work balancing
Parallel Computation Thinking
11
Understand every bit of the “machine” is important. Resources shared by many, and need to know the underlying “rules” for best utilization. Redundant computations are acceptable for overall performance of the “team”How to divide the work becomes a significant problem (work balancing), cooperation in a team with thousands of people is not easy. Communication adds another layer of complexity. Not that more people means fastest.Optimization may not be true “tomorrow”
33. Differences in Computational Thinking . Differences in Computational Thinking Difficulties in Massively Multi-threading Programming1. Indexing mechanism – mapping of data to “computer”
(learn throughout the course)(learn throughout the course)Total “global store”“global store” in “tile form”# of threads in a block to handle “global store”# of threads in a block formed a tile of a different tile size as the global store
12
2. Memory (hardware) consideration (Session #5)
Shared memory with no bank conflictGlobal memory access in a coalesced manner
3. Program correctness without race condition (Session #6)
7
4a. CPU 4a. CPU (HOST) (HOST) Talking to GPU Talking to GPU (DEVICE)(DEVICE)Pass Input
Write to Global MemoryAsk GPU where to put the dataPut the data there
(more to come)…
Get OutputCopy from Global Memory
13
Know where to pick up the data
Leave it within device for subsequent computation
Ask the work (outsourcing) to be doneLaunching Kernel
4a. What is CUDA4a. What is CUDA“Compute Unified Device Architecture”NVIDIA’s solution to parallel computation (for G80 series onward)Parallel programming modelMassively multithreaded hardware
GPU = dedicated many-core co-processorGeneral purpose programming model
User launches batches of threads on the GPU Fully general load/store memory model (concurrent read concurrent write)Simple extension to standard C/C++
14
Mature(d) software stack (high-level and low-level access)
8
4a. What is CUDA4a. What is CUDAIts core are three key abstractions exposed to programmers:
A hierarchy of thread groupsShared memories, andBarrier synchronization
The abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition:
the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and
15
each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.
This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability.Difficulty to us: NOT ALL BLOCKS ARE IN THE COMPUTATION AT THE SAME TIME.
4a. Talking through Global Memory4a. Talking through Global MemoryGPU memory allocation / release
cudaMalloc(void **pointer, size_t nbytes);cudaMemset(void *pointer int value size t count);cudaMemset(void pointer, int value, size_t count);cudaFree(void* pointer);
Data copycudaMemcpy(void *dst, void *src, size_t nbytes,
enum cudaMemcpyKind direction);direction specifies locations (host or device) of src and dstBlocking (there is a corresponding non-blocking version)
16
Blocking (there is a corresponding non-blocking version)Doesn’t start copying until previous CUDA calls completeenum cudaMemcpyKind
cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice
9
4b. Launching Kernels4b. Launching KernelsModified C/C++ function call syntax
kernel<<<dim3 grid, dim3 block, int smem, int stream>>>(…)
Execution Configuration (“<<< >>>”)grid: grid dimensions x and yblock: thread-block dimensions x, y, and zsmem: dynamic shared memory size, measured by number of bytes per block
17
optional, 0 by default
stream: stream IDoptional, 0 by default
CUDA C Program – main partCUDA kernel
4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (1 of 3)Vector Addition (1 of 3)
__global__ void VecAdd( float *A, float *B, float *C ){
int i = blockIdx.x * blockDim.x + threadIdx.x;if ( i < N ) C[i] = A[i] + B[i];
}
id i ()
Device code
Host code
18
void main(){
......int threadsPerBlock = 256;int blocksPerGrid =(N + threadsPerBlock – 1)/threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C );
}
Host code
10
4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (2 of 3)Vector Addition (2 of 3)#include <cutil_inline.h>#define N 204800
// Device code//__global__ void VecAdd( float *A, float *B, float *C ){
int i = blockIdx.x * blockDim.x + threadIdx.x;if ( i < N ) C[i] = A[i] + B[i];
}
// Host codeint main(){
19
{// Allocate vectors in device memorysize_t size = N * sizeof(float);float *d_A, *d_B, *d_C;cudaMalloc( (void**)&d_A, size );cudaMalloc( (void**)&d_B, size );cudaMalloc( (void**)&d_C, size );
4c. CUDA : 14c. CUDA : 1stst PP PP –– Vector Addition (3 of 3)Vector Addition (3 of 3)float* h_A = (float *) malloc ( size );float* h_B = (float *) malloc ( size );float* h_C = (float *) malloc ( size );...add code to initialize h_A and h_B// C t f h t t d i// Copy vectors from host memory to device memorycudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice );cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice );
// Invoke kernelint threadsPerBlock = 256;int blocksPerGrid =(N + threadsPerBlock – 1)/threadsPerBlock;VecAdd<<<blocksPerGrid, threadsPerBlock>>>( d_A, d_B, d_C );
// Copy result from device memory to host memory
20
// Copy result from device memory to host memory// h_C contains the result in host memorycudaMemcpy( h_C, d_C, size, cudaMemcpyDeviceToHost );
// Free device memorycudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
}
Demo: Study MS Visual Studio Project on Vector Addition, different block size, timing with/out copying etc.
11
CUDA C Program – main part (matrix N x N)
__global__void addMatrixG( float *a float *b float *c int N )
CUDA kernel
4c. CUDA : Matrix Addition (2D case) 4c. CUDA : Matrix Addition (2D case)
void addMatrixG( float *a, float *b, float *c, int N ){
int i = blockIdx.x * blockDim.x + threadIdx.x;int j = blockIdx.y * blockDim.y + threadIdx.y;int index = i + j * N;if ( i < N && j < N )
c[index] = a[index] + b[index];}
void main()
Device code
Host code
21
(){
......dim3 dimBlk( 16, 16 );dim3 dimGrd( (N-1)/dimBlk.x + 1, (N-1)/dimBlk.y + 1 );addMatrixG<<<dimGrd, dimBlk>>>( a, b, c, N );
}
Host code
5. CUDA : 5. CUDA : Thread HierarchyThread HierarchyA kernel is executed by a gridof thread blocks
A th d bl k i b t h fA thread block is a batch of threads that can cooperate with each other by
Sharing data through shared memorySynchronizing their execution
22
Threads from different blocks cannot synchronize (within a kernel)
Some blocks may not be created when others have finished their computation
12
55. CUDA : Block . CUDA : Block IDs and Thread IDsIDs and Thread IDsEach thread uses IDs to decide what data to work on
Block ID: 1D or 2DBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data
23
multidimensional dataImage processingSolving PDEs on volumes…
Session (2 of 8)Session (2 of 8)Day 1, Morning, 10:50am – 12:30noon – Get it going
5. CUDA basic concepts (organizing the resources)- Threads, blocks and grid
6. Hands-on #1: 2nd PP – image processing- Learn to deal with texture/display
7. Working with Microsoft Visual Studio- Learn to create project to work from any computer
8 H d #2 3rd PP V i Di ( t 1)
24
8. Hands-on #2: 3rd PP – Voronoi Diagram (part 1) - Learn to create a (quick) solution from scratch
Lunch break: 12:30noon – 2:00pm
13
5. CUDA at work5. CUDA at work
Device
Global Memory
HostShared memory
25Texture
working space: register
5. CUDA at work5. CUDA at workFor the host:
How “best” to pass the data to the device Options of global memory, constant memory, function parameters (and cuda array, texture)y, )
How “best” to get the work organized (in passes) to the deviceOptions to do filtering before actual workOptions to cut work into pieces to be doneOption to do the work yourself instead of outsourcing
For the device: How “best” to deploy the resources to get the work done
Options to organize into up to 2D blocks of up to 3D threads
26
Options to cooperate among threads through shared memoryOptions to cooperate across blocks through global memoryOptions on when to move data around for computation
14
5. CUDA5. CUDAComes with
Hardware architectureProgramming modelProgramming interface
C/C++ for CUDA (extension to the C/C++ language)CUDA Driver API
Software development environmentTools (compiler, debugger, profiler), C/C++ runtime, libraries
Only works on NVIDIA G80 (Geforce 8000 series) and newer cardsDesigned to scale well over time
27
Designed to scale well over timeMeant to be more general than just graphics cards
55. CUDA . CUDA Programming ModelProgramming ModelSome design goals
Transparently scales to hundreds of cores and thousands of parallel threadsof parallel threads
Allows software to run efficiently across varying hardware
Let programmers focus on parallel algorithmsNot mechanics of a parallel programming languageC/C++ for CUDA plus runtime API
Enable heterogeneous systems (i.e. CPU + GPU)
28
Enable heterogeneous systems (i.e. CPU GPU)CPU and GPU are separate devices with separate DRAMs
15
5. CUDA : 5. CUDA : Kernels and ThreadsKernels and ThreadsDefinitions
Device = GPUHost = CPUHost = CPUKernel = function that runs on the device
Parallel portions of an application are executed on the device as kernels
One kernel is executed at a time (Fermi can run multiple kernels, using different stream numbers other than 0)Many threads execute the same kernel (on different data)
29
Many threads execute the same kernel (on different data)CUDA threads are extremely lightweight
Very little creation overheadInstant switching
CUDA uses thousands of threads to achieve efficiency
5. CUDA : Arrays 5. CUDA : Arrays of Parallel Threadsof Parallel ThreadsA CUDA kernel is executed by an array of threads
All threads run the same codeSPMD model –– Single Program, Multiple Data
Each thread has an ID to identify itself so that it knows what computation or control to work on (e.g. address computation)
76543210threadID
30
…float x = input[threadID];float y = func(x);output[threadID] = y;…
16
55. CUDA : Transparent . CUDA : Transparent ScalabilityScalabilityHardware is free to schedule thread blocks to any processor
A kernel scales across any number of parallelA kernel scales across any number of parallel multiprocessors
Device
Block 0 Block 1
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3ti
31
Each block can execute in any order relative to other blocks.
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Block 4 Block 5 Block 6 Block 7time
5. CUDA : 5. CUDA : Thread Thread Hierarchy Hierarchy (duplicate)(duplicate)
A kernel is executed by a grid of thread blocks
A th d bl k i b t h fA thread block is a batch of threads that can cooperate with each other by
Sharing data through shared memorySynchronizing their
32
execution
Threads from different blocks “cannot” cooperate
17
5. CUDA : Block 5. CUDA : Block IDs and Thread IDs and Thread IDs IDs (duplicate)(duplicate)
Each thread uses IDs to decide what data to work on
Block ID: 1D or 2DBlock ID: 1D or 2DThread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data
33
multidimensional dataImage processingSolving PDEs on volumes…
5. CUDA : Execution 5. CUDA : Execution ModelModelKernels are launched in grids
One Kernel executes at a time (not true for Fermi)
A block executes on one multiprocessorA block executes on one multiprocessorDoes not migrate
Several blocks can reside concurrently on one multiprocessorControl limitations (of G8X/G9X GPUs) (128 Scalar Processors grouped into 16 Streaming Multiprocessors SMs)
At most 8 concurrent blocks per SMS
34
At most 768 concurrent threads per SM
Number is further limited by SM resourcesRegister file is partitioned among all resident threadsShared memory is partitioned among all resident thread blocks
18
5. CUDA : Device 5. CUDA : Device CodeCodeC/C++ functions with some restrictions
Can only access GPU memory (except for Zero Copy feature)No variable number of arguments (“varargs”)No variable number of arguments ( varargs )No static variable
Function qualifiers__global__
Function is a kernelMust have void return typeA call to function must specify execution configuration
35
__device__Only for the device to call
__host__ (default)Only for the host to call
__host__ and __device__ can be used together
5. CUDA : Built5. CUDA : Built--In In Device VariablesDevice VariablesAll __global__ and __device__functions have access to these automatically defined variables
dim3 gridDim;dim3 gridDim;Dimensions of the grid in blocks (gridDim.z must be 1)
dim3 blockDim;Dimensions of the block in threads
dim3 blockIdx;Block index within the grid
36
Block index within the grid
dim3 threadIdx;Thread index within the block
19
5. CUDA : Variable 5. CUDA : Variable Qualifiers (GPU Code)Qualifiers (GPU Code)__device__
Limited per thread (16KB – 32KB for old architecture, 512KB for Fermi)Stored in global memory (large, high latency, no cache)
__shared__Stored in on-chip shared memory (very low latency)Accessible concurrently by all threads in the same thread block
Unqualified variablesScalars and built-in vector types are stored in registersArrays of more than 4 elements or run-time indices stored in local memory
(not in kernel) __constant__ (limited in size, about 64KB, global
37
( ) __ __ ( , , gSame as __device__, but cached and read-only by GPUWritten by CPU via cudaMemcpyToSymbol()
66. Hands. Hands--on #on #11: : 22ndnd PP PP –– Process Image Process Image .cpp and .cu programs are given
Briefing on reading image (as array of imgWidth * imgHeight * 4 GLbyte)Briefing on OpenGL texture, texture swapping for displayBriefing on GLUT for keyboard input
Task 1: Get the application working (see also Section #7)Include the necessary CUDA build rules to get .cu program to be compiledInclude cudart.lib to perform linkingInclude correct path for CUDA stuff, e.g. $(CUDA_LIB_PATH) Execute and run the program
Task 2: Perform if you like other filtering of the image
38
Task 2: Perform, if you like, other filtering of the imageFor example, remove red, or average surrounding pixels, etc.
Option: 1 (original) Option: 2 (inverted)
20
6. Hands6. Hands--on #1: 2on #1: 2ndnd PP PP –– Process Image Process Image OpenGL / Glut Program
main
glut
glut_mouseMotionglut_mouse glut_keyboard
glutMainLoop
glutSwapBuffersinvertImage
glut
39
cudaInvertImage
KernelInvertImageGPU
g
CPUAllocate memory etc.to execute kernel
7. Working with Microsoft VS7. Working with Microsoft VSDownload & install CUDA Toolkit (for run time) + SDK (for development) from Developer Zone (once)
http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.htmlTo create a new Win32 Console Application with Visual Studio:To create a new Win32 Console Application with Visual Studio:
File New Project. Select Win32 Console ApplicationEnter MyApp as the name of the project. Under Application Settings, check Empty Project
To make sure nvcc compiles your .cu file (before CUDA ver 3.2)Assume Cuda.rules is somewhere in your system (C:\Program Files\Microsoft Visual Studio 9.0\VC\VCProjectDefaults). Right click on MyApp in the Solution Explorer, choose Custom Build
40
g yRules, find existing Cuda.rules files to be included, and then check the Cuda.rules file.
To exclude (sometimes) some .cu files from build:Right click on it in the Solution Explorer, and select PropertiesUnder General, set Excluded From Build to Yes
21
7. Working with Microsoft VS7. Working with Microsoft VSTo configure the project properties
Select Project MyApp Properties from the main menu
Select All Configuration for Configuration
Fill in Configuration Properties Linker General Additional Library Directories with $(CUDA_LIB_PATH) (and any other paths, such as “$(NVSDKCOMPUTE_ROOT)/C/common/lib ; ..\lib”)
Fill in Configuration Properties Linker Input Additional Dependencieswith cudart.lib (and other libraries if needed, for example, cutil32D.lib cutil32.lib glut32.lib)
Fill in Configuration Properties CUDA Build Rule vX.X.0 General with “sm_11” (or higher) instead of “sm_10” in GPU Architecture (in case you use atomic operation)
41
atomic operation)
7. CUDA Configuration7. CUDA ConfigurationYou ask and we try to answer
42
22
7. CUDA 7. CUDA Programming InterfacesProgramming InterfacesCUDA driver API
Low-level C/C++ API to load compiled kernels, inspect their parameters, and to launch themKernels are written in C/C++ and compiled to CUDA binary or assembly codeRequires more code, harder to program and debug
C/C++ for CUDAMinimal set of extensions to the C/C++ languageKernels defined as C/C++ functions embedded in application
43
Kernels defined as C/C++ functions embedded in application sourceRequires a runtime API (built on top of CUDA driver API)
77. CUDA . CUDA LayersLayers
cudart.lib
cudart.lib
44
23
7. Compiling 7. Compiling a CUDA Programa CUDA Program
45
7. Compilation7. CompilationAny source file containing CUDA language extensions must be compiled with NVCCNVCC is a compiler driverNVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...
NVCC outputsC/C++ code (host CPU code)
Must then be compiled with the rest of the application using th t l
46
another toolPTX (Parallel Thread Execution)
Object code directlyOr, PTX source, interpreted at runtime
24
8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagramInput
Output
47
8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagramInput: A collection of numPoints colored 2D sites
Output: A map that colors the 2D (digital) space into various regions that “belong” to the sites
Processing:for each point p in the 2D space (height x width)
record the distance from p to site[0] as the min distancetake note of index 0 achieving the min distance
for each site from 1 to (numPoints-1)if (distance from p to this site is shortest so-far)
record this distance as the min distancetake note of this index for p
48
endIfendForrecord the closest site for p
endFor
Remarks: 1. Toggle between CPU & GPU options to check for correctness2. Toggle between CPU & GPU options to experience the speed
25
8. Hands on #2: 38. Hands on #2: 3rdrd PP PP –– VoronoiVoronoi DiagramDiagram
glut_keyboard
C t * lt i l M d
cpuVoronoiDiagram cudaVoronoiDiagram
computeVoronoiDiagram
CPU
Convert *result using colorMap, and generate texture from outputTexture
*output is broken into 16x16 tiles, each for a thread block, and there are 2D blocks in a grid
49
kernelNaiveVoronoiDiagram kernalTextureVoronoiDiagramGPU
kernelSharedVoronoiDiagram
Session (3 of 8)Session (3 of 8)Day 1, Afternoon, 2:00pm – 3:30pm – Get it good
9. CUDA Hardware Implementation - Product vs Architecture - CUDA mode vs Graphics mode- Execution Model- Memory Model : Shared memory, Texture….
10.Optimization I- Thread Corporation (through [static] Shared Memory)
T t F t h
50
- Texture Fetch- CUDA: 4th PP – Max Distance Computation
11.Hands-on #3: 3rd PP – Voronoi Diagram (Part 2) - Modify to use shared memory, texture
Tea break: 3:30pm – 4:00pm
26
http://electronicdesign.com/article/embedded/simt-architecture-delivers-double-precision-terafl.aspx
9a. Tesla S1070 1U rack mount 9a. Tesla S1070 1U rack mount
- 4 Tesla cards in each system
- Each 1U connects to 2 PC systems
51
9a. Tesla C1060 9a. Tesla C1060
Desktop version of TeslaThis is for computing, and thus need a separate display card for PC240 CUDA Cores, 4GB memoryNewer version: C2050/C2070: 448 CUDA Cores, 3GB/6GB memory
52
27
99a. Fermi Architecturea. Fermi Architecture
53http://www.hwmania.org/forum/schede-video-thread-ufficiali/9776-nvidia-fermi-gtx-470-gtx-480-a.html
CUDA v2.0, v2.1512 CUDA Core
9a. Product 9a. Product vsvs Architecture Architecture
ProductArchitectureG80/G90
ArchitectureGT200
ArchitectureFermi
G Force(consumer)
e.g. 9800GTX e.g. GTX280
e.g.GTX480(consumer) GTX280 GTX480
Quadro(professional)
e.g.Quadro FX3700
e.g.FX5800
e.g. Quadro 6000
Tesla(server/desktopserver)
e.g. C1060, &S1070 (4 cards)
e.g. C2050,C2070
54
CUDACUDA Version 1.0, 1.1, 1.2 1.3 2.0, 2.1Max CUDA Core
128 240 512
Major feature 1.1: atomic (global memory)1.2: atomic (shared mem)
1.3: double precision 2.0: L1/L2 cache
28
9a. Specification of Different Architectures9a. Specification of Different Architectures
55http://www.hwmania.org/forum/schede-video-thread-ufficiali/9776-nvidia-fermi-gtx-470-gtx-480-a.html
9b. G80 9b. G80 Device (In Graphics Mode)Device (In Graphics Mode)For reference
Setup / Rstr / ZCullInput Assembler
Host
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
p
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
56
L2
FB
L1 L1 L1 L1 L1 L1 L1 L1
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
29
9b. G80 9b. G80 Device (In CUDA Mode)Device (In CUDA Mode)Thread Execution Manager issues threads128 Scalar Processors grouped into 16 Streaming Multiprocessors(SMs)Parallel Data Cache (Shared Memory) enables thread cooperation
Thread Execution Manager
Input Assembler
Host
Parallel DataParallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
57
Load/store
Global Memory
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
9c. Execution 9c. Execution ModelModelKernels are launched in grids
One kernel executes at a timeA block executes on one multiprocessorA block executes on one multiprocessor
Does not migrateSeveral blocks can reside concurrently on one multiprocessor
Control limitations (of G8X/G9X GPUs):At most 8 concurrent blocks per SMAt most 768 concurrent threads per SM
58
Number is further limited by SM resourcesRegister file is partitioned among all resident threadsShared memory is partitioned among all resident thread blocks
30
Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1SM Version sm_10 sm_11 sm_12 sm_13 sm_20 sm_21Threads / Warp 32 32 32 32 32 32
9c. Execution Model (different CUDA 9c. Execution Model (different CUDA verver))
Threads / Warp 32 32 32 32 32 32Warps / Multiprocessor 24 24 32 32 48 48Threads / Multiprocessor 768 768 1024 1024 1536 1536Thread Blocks / Multiprocessor 8 8 8 8 8 8Max Shared Memory / Multiprocessor (bytes) 16384 16384 16384 16384 49152 49152
Register File Size 8192 8192 16384 16384 32768 32768
Register Allocation Unit Size 256 256 512 512 64 64
Allocation Granularity block block block block warp warp
Shared Memory Allocation Unit Size 512 512 512 512 128 128
Warp allocation granularity (for registers) 2 2 2 2
Max Thread Block Size 512 512 512 512 1024 1024
Shared Memory Size Configurations (bytes) 16384 16384 16384 16384 49152 49152
[note: default at top of list] 16384 16384
59
Registers
9d. CUDA : Memory Hierarchy9d. CUDA : Memory Hierarchy
Gl b l U h d
L2 / Texture memorycached, read-only
L1 / Shared memory per block
60
Global memory – Uncachedslow random/uncoalesced access
L1 :Shared memory (in Fermi) is configurable
Either 0K:48K, or 32K:16K
31
Each thread canR/W per-thread registersR/W per-thread local memory
99d. CUDA : Memory Modeld. CUDA : Memory Model
R/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRO per-grid constant memoryRO per-grid texture memory
Host can R/W global, constant and texture memory
61
Host
9d. CUDA : Memory Model9d. CUDA : Memory Model
62
32
9d. CUDA 9d. CUDA Memory SpacesMemory SpacesGlobal and Shared Memory
Most important, commonly usedLocal Constant and Te t re Memor for con enience /Local, Constant, and Texture Memory for convenience / performance
Local: automatic array variables allocated there by compilerConstant: useful for uniformly-accessed read-only data
Cached
63
Texture: useful for spatially coherent random-access read-only data
CachedProvides filtering, address clamping and wrapping
1010a. Thread a. Thread CooperationCooperationThreads in a thread block can cooperate
Sharing data through shared memorySynchronizing their execution (using __syncthreads())
Thread cooperation is valuableShare results to save computationShare memory accesses
Drastic bandwidth reduction
64
33
10a. CUDA : Thread 10a. CUDA : Thread Synchronization Synchronization FnFnvoid __syncthreads();
Synchronizes all threads in a blockGenerates barrier synchronization instructionNo thread can pass this barrier until all threads in the block reach itUsed to avoid RAW (read after write) / WAR / WAW hazards when accessing shared memory
To maintain the sequential order in concurrent environment
65
To maintain the sequential order in concurrent environment
10a. Thread 10a. Thread Blocks: Scalable CooperationBlocks: Scalable CooperationDivide monolithic thread array into multiple blocks
Threads within a block cooperate via shared memory, atomic operations and barrier synchronizationatomic operations and barrier synchronizationThreads in different blocks “cannot” cooperate
Enables programs to transparently scale to any number of processors
Thread Block 0 Thread Block 0 Thread Block N - 1
66
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
……float x = input[threadID];float y = func(x);output[threadID] = y;…
…float x = input[threadID];float y = func(x);output[threadID] = y;…
76543210 76543210 76543210
34
10a. Improving 10a. Improving Memory Memory BandwidthBandwidthGlobal memory has much higher latency (roughly 100x) and lower bandwidth than on-chip shared memory
Should minimize global memory accessCan stage data in shared memory before operated on
Typical programming pattern using shared memory. For each thread of a block
(1) Load data from global memory to shared memory(2) Synchronize so that it is safe to read shared memory locations written by other threads
67
(3) Process data in shared memory(4) Synchronize again if necessary to make sure shared memory has been updated with results(5) Write results to global memory
1010a. Using Shared Memorya. Using Shared MemoryStatic shared memory
__shared__ int2 tile[TILE_SIZE];
Dynamic shared memoryPass the amount through kernel invocation
Kernelname<<<gridsize, blocksize, sMemByte, …>>>
Inside the kernel, we have a pter initialized to the startextern __shared__ int2 *smem_ptr; // already initialized
68
float *smem_pt2 = (float *) (smem_ptr + 100); // offset
35
10a. Thread Cooperation from Diff Blocks ?10a. Thread Cooperation from Diff Blocks ?Threads in different thread blocks “can” cooperate but cannot synchronize
Cooperate using atomic operations (using atomic*())Cooperate using atomic operations (using atomic ())An atomic function performs a read-modify-write atomic operation, without interference from other threads
Cannot synchronize ALL threads within kernelMay cause deadlockNeed a lot of hardware resources to store the states of all
69
Need a lot of hardware resources to store the states of all unfinished threads
All threads of a kernel are synchronized at the end kernel execution or between kernel launches
10b. Texture Memory10b. Texture MemoryCUDA supports the use (reading) of texture memory (for graphics purposes)
U i t t j hi b fit N t th tUsing texture memory enjoys caching benefit. Note that in Fermi, this becomes less significant as global memory also allows caching
A texture can be any region of linear memory or a CUDA array
Texture can be 1D 2D 3D and composed of elements
70
Texture can be 1D, 2D, 3D and composed of elements, each of which has 1, 2, or 4 components that may be signed or unsigned 8-, 16- or 32-bit integers, 16-bit floats, or 32-bit floats.
36
10b. CUDA 10b. CUDA Texture TypesTexture TypesBound to linear memory
Global memory address is bound to a textureOnly 1DInteger addressingInteger addressingNo filtering, no addressing modes (clamping, repeat)
Bound to CUDA arraysBlock linear CUDA array is bound to a texture (in texture memory space)1D, 2D, or 3DFloat addressing (size-based or normalized)Filtering
71
Addressing modes (clamping, repeat)Bound to pitch linear
Global memory address is bound to a texture2DFloat/integer addressing, filtering, and clamp/repeat addressing modes similar to CUDA arrays
10b. Memory 10b. Memory Optimization Using TexturesOptimization Using TexturesTexture fetches vs. global memory reads
Texture fetches are cachedOptimized for 2D spatial localityUseful when threads of a warp do random read from locations that are close together in 2D
Textures can be used to avoid uncoalesced loads from global memoryAlso useful when it is awkward to organize data in shared
72
memory
37
10b. Using Texture Memory10b. Using Texture MemoryHost (CPU) code
Allocate/obtain memory (global linear/pitch linear, or CUDA array)y)Create a texture reference object
Currently must be at file-scopeBind the texture reference to memory/arrayWhen done, unbind the texture reference, free resources
Device (kernel) code
73
Fetch using texture referenceLinear memory textures: tex1Dfetch()Array textures: tex1D() or tex2D() or tex3D()Pitch linear textures: tex2D()
10b. Using Texture Memory10b. Using Texture MemoryDeclare a global variable at beginning of .cu file
texture<type> globalTexRef;
Before invoking kernel Bind allocatedData (cudaMalloc memory) to globalTexRef
cudaBindTexture(0, globalTexRef, allocatedData);Inside the kernel
Read data as:
74
Point = tex1Dfetch ( globalTexRef, locationNeeded);Return from kernel
cudaUnbindTexture(globalTexRef);
38
10b. Example10b. ExampleUnoptimized data shifts__global__ void shiftCopy(float *odata,
float *idata, int shift) {{
int xid = blockIdx.x * blockDim.x + threadIdx.x; odata[xid] = idata[xid+shift];
}
Optimized using texture fetches__global__ void textureShiftCopy(float *odata,
float *idata int shift)
75
float *idata, int shift) {
int xid = blockIdx.x * blockDim.x + threadIdx.x; odata[xid] = tex1Dfetch(texRef, xid+shift);
}
Significant before compute capability ver 1.2
10b. Performance 10b. Performance ComparisonComparison
Ver 1.0
Ver 1.3
76
39
10c. 410c. 4thth PP PP –– Distance Computation (1 of 4)Distance Computation (1 of 4)CPU version (inside maxDist.cu)
static void CPU_MaxDist( int2 *pointArray, int numPoints, float *maxDistArray )
{for ( int i = 0; i < numPoints; i++ ) {float maxSqrDist = 0.0f;const int2 *pi = &pointArray[i];for ( int j = 0; j < numPoints; j++ ){
const int2 *pj = &pointArray[j];float sqrDist = sqrf( pj->x - pi->x ) + sqrf( pj->y - pi->y );if ( sqrDist > maxSqrDist ) maxSqrDist sqrDist;
77
if ( sqrDist > maxSqrDist ) maxSqrDist = sqrDist;}maxDistArray[i] = sqrt( maxSqrDist );
}}
1010c. c. 44thth PP PP –– Distance Computation (Distance Computation (2 2 of of 44))GPU version without shared memory (inside maxDist.cu)
__global__ void GPU_MaxDist1(int2 *pointArray, int numPoints, float *maxDistArray )
{int tx = blockIdx.x * blockDim.x + threadIdx.x;float max = 0.0, temp = 0.0;int2 point = pointArray[tx];for(int i=0; i < numPoints; i++){
int2 point1 = pointArray[i];temp = sqrf(point1.x-point.x) + sqrf(point1.y-point.y);if (temp>max)
78
if (temp>max)max = temp;
}maxDistArray[tx] = sqrt(max);
}
40
10c. 410c. 4thth PP PP –– Distance Computation (3 of 4)Distance Computation (3 of 4)GPU version with shared memory (inside maxDist.cu)
__global__ void GPU_MaxDist2( int2 *pointArray, int numPoint, float *maxDistArray )
{{ __shared__ int2 tile[ TILE_SIZE ];int2 point = pointArray[blockIdx.x * blockDim.x + threadIdx.x];
float temp = 0.0, max = 0.0;for ( int a = 0; a < numPoints; a += TILE_SIZE){
tile[threadIdx.x] = pointArray[a + threadIdx.x];__syncthreads();for(int k =0; k < TILE SIZE; k++)
79
o ( t 0; < _S ; ){ int2 point1 = tile[k];
temp = sqrf(point1.x-point.x) + sqrf(point1.y-point.y);if( a + k < numPoints && temp > max ) max = temp;
}__syncthreads();
}maxDistArray[blockIdx.x*blockDim.x + threadIdx.x] = sqrt(max);
}
10c. 410c. 4thth PP PP –– Distance Computation (4 of 4)Distance Computation (4 of 4)GPU version with texture memory (inside maxDist.cu)
texture<int2> texPoints;// binding needed before coming into the kernel__global__ void GPU_MaxDist3( int numPoints, float *maxDistArray ){
int tx = blockIdx.x * blockDim.x + threadIdx.x;int2 myPoint, point;float temp = 0.0, max = 0.0;myPoint = tex1Dfetch( texPoints, tx);
for (int i = 0; i<numPoints; i++){
point tex1Dfetch( texPoints i);
80
point = tex1Dfetch( texPoints, i);temp = sqrf(myPoint.x-point.x) + sqrf(myPoint.y-point.y);if( temp > max ) max = temp;
}maxDistArray[blockIdx.x*blockDim.x + threadIdx.x] = sqrt(max);
}
41
11. Hands11. Hands--on #3: 3on #3: 3rdrd PP PP –– VD (VD (Part 2Part 2) ) Do the Voronoi Diagram exercise with shared memory (& texture if time permits)
• Assume numPoints is divisible by TILE size• One shared memory solution with 16x16 tile size
• The image is divided into 16x16 thread blocks• Threads in a block is to cooperate in reading in sites info in batches of 16x16 items for processing
• Specifically, within a thread block• We have point(tx, ty) with reference to the image• We have tid with reference to the position in a tile• Go thru all the sites in 16x16 batches as follows:
• thread with tid, read site i+tid into shared memory
81
thread with tid, read site i+tid into shared memory tid where i is in increment of 16x16
• __syncthreads()• find closest site among these 16x16 sites for thread tid
• Record found site to result[ty*width + tx]
Toggle between modes to check for correctness & efficiency
11. Hands11. Hands--on #3: 3on #3: 3rdrd PP PP –– VD (VD (Part 2Part 2) ) Do the Voronoi Diagram exercise with shared memory, & texture
tx=1
ty = 2
tid
Image
82
sites
batch 0 batch 1 ………
42
Session (4 of 8)Session (4 of 8)Day 1, Afternoon, 4:00pm – 5:30pm – Get it better
12.Hands-on#4: 5th PP – Parallel Prefix Operation13.Optimization II
a) Avoiding divergenceb) Avoiding idle threadc) “Algorithmic” optimization (& no bank conflict)d) Avoiding idle computatione) “auto” synchronization within a warp (without __syncthreads)
14.CUDA Occupancy Calculator
83
p y- No. thread per block, Amt shared memory, Amt registers
Home-sweet-home
12. Parallel Prefix Operation12. Parallel Prefix OperationThe operation takes a binary associative operator ⊕ and an array of n elements
[ ][a0, a1, ..., an–1]
and returns the array
[a0, (a0 ⊕ a1), ..., (a0 ⊕ a1 ⊕ ... ⊕ an–1)]
Example:Input = [3 1 7 0 4 1 6 3]
84
Good example of a computation that seems inherently sequential, but in fact it has an (work) efficient parallel algorithm
Output = [3 4 11 11 15 16 22 25]
43
12. Naive 12. Naive Parallel ScanParallel Scan
85
12. Naive 12. Naive Parallel ScanParallel ScanA “in-place” naive parallel scan (incorrect)for d := 0 to log2n – 1 do
forall k in parallel do
A double-buffered version (correct)
pif k ≥ 2d then x[k] := x[k − 2d] + x[k]
for d := 0 to log2n do forall k in parallel do
if k ≥ 2d then
86
x[out][k] := x[in][k − 2d] + x[in][k] else
x[out][k] := x[in][k] swap(in,out)
44
12. Algorithm 12. Algorithm ComplexityComplexityStep complexity, S(n) = O(log n)
Work complexity, W(n) = O(n log n)( )l
Number of addition operations =Sequential scan performs O(n) additionsNaive parallel scan is not work-efficient
Work-efficient parallel scan algorithms exist
( ) ( )nnOnn
dd
2log
11 log22 =−∑ =
−
87
12. Hands12. Hands--on #4: 5on #4: 5thth PP PP –– Parallel PrefixParallel PrefixInput: X[0], X[1], X[2],…X[n]
Output: Y[0], Y[1], Y[2],…Y[n] s.t. Y[i] = X[0] + X[1] + … X[i]
E.g. Input: 5, 3, 2, 6, 7, 1, 1, 2Output: 5, 8, 10, 16, 23, 24, 25, 27
CPU ProgramY[ 0 ] = X[ 0 ];for (int i = 1; i < n; i ++)
88
{Y[ i ] = Y [ i – 1] + X[ i ];
}
45
Write a (naïve) GPU version“2” possibilities ??
Multiple rounds: call kernel with different step length in each round
12: Hands12: Hands--on #4: 5on #4: 5thth PP PP –– Parallel PrefixParallel Prefix
p p gOne round: call kernel to do different rounds in one go, use synchronization (??)
Questions: How big can the array be? What else can improve?If time permit, write a GPU version (2 kernels) using shared memory
Handle input as tile of 256 itemsDeclare __shared__ int tile[2][TILE_SIZE] for double bufferingCooperate to read in a tile of pointp pDo parallel prefix for tile with multiple rounds inside kernelWrite results to global memory, and also write (for thread 0) result of the last item to another array in global memory
Do (recursively) next higher level of 256 items (which is tile level)Add results of each tile back to lower level
89
1212. Parallel . Parallel Scan of Large ArraysScan of Large Arrays
90
46
13. Optimization II: Parallel Reduction13. Optimization II: Parallel ReductionReduce
out = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[n–1]where ⊕ is associativewhere ⊕ is associative
Parallel Prefix / Scanout[i] = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[i]
where ⊕ is associativeE.g. all-prefix-sums
91
13. Optimization II: Parallel Reduction13. Optimization II: Parallel ReductionCommon and important data parallel primitive
Easy to implement in CUDANaïve version uses global memory as in the parallel prefixHard to get it fast
Serves as a great optimization exampleWe’ll walk step-by-step through 5 different versionsDemonstrates several important optimization strategies
92
p p g
Based on Mark Harris’ “Optimizing Parallel Reduction in CUDA” tutorial
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
47
13. Optimization II: Parallel 13. Optimization II: Parallel ReductionReductionTree-based approach used within each thread block using shared memory
Need to be able to use multiple thread blocksTo process very large arrays
93
To keep all multiprocessors on the GPU busyEach thread block reduces a portion of the array
But how do we communicate partial results between thread blocks?
1313. Solution. Solution: Kernel Decomposition: Kernel DecompositionSolution: decompose into multiple kernels
Kernel launch serves as a global synchronization pointKernel launch has negligible HW overhead low SW overheadKernel launch has negligible HW overhead, low SW overhead
94
In the case of reductions, code for all levels is the sameRecursive kernel invocation
Therefore, we focus on reduction within a single thread block
48
13. Optimization 13. Optimization Goal and MetricGoal and MetricWe should strive to reach GPU peak performance
Choose the right metricGFLOP/s: for compute-bound kernelsBandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth
95
Therefore we should strive for peak bandwidth
Will use G80 GPU for this example384-bit memory interface, 900 MHz DDR384 * 1800 / 8 = 86.4 GB/s
13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// h th d l d l t f l b l t h d// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();
// do reduction in shared memfor(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0)sdata[tid] += sdata[tid + s];
96
sdata[tid] += sdata[tid + s];
__syncthreads();}
// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Solution to hands-on#4 with shared memory
49
13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing
97
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// h th d l d l t f l b l t h d
13a. Reduction 13a. Reduction #1: Interleaved Addressing#1: Interleaved Addressing
// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();
// do reduction in shared memfor(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0)sdata[tid] += sdata[tid + s]; Problem: highly
98
sdata[tid] += sdata[tid + s];
__syncthreads();}
// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
divergent warps are very inefficient, and %
operator is very slow
50
13a. Performance 13a. Performance for 4M Element Reductionfor 4M Element Reduction
Note: Block Size = 128 threads for all tests
Divergent Threads (in the same warp)
Possibility #1: Two threads doing different things
P ibilit #2 S th d d thi d th th d thi
99
Possibility #2: Some threads do same thing, and the other do nothing … our example (not very serious)
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// h th d l d l t f l b l t h d
13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved Addressing
// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();
// do reduction in shared memfor (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;if (index < blockDim.x)
100
if (index < blockDim.x) sdata[index] += sdata[index + s];
__syncthreads();}
// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
51
13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved Addressing
101New Problem: Shared Memory Bank Conflicts (Session #5)
* Difficulty of indexing / addressing of elements appear here.
13b. Reduction 13b. Reduction #2: Interleaved Addressing#2: Interleaved AddressingS tid index effect
1 0 0 sdata[0] += sdata[1]
1 2 sdata[2] += sdata[3]
2 4 sdata[4] += sdata[5]
3 6 sdata[6] += sdata[7]
… … …
2 0 0 sdata[0] += sdata[2]
1 4 sdata[4] += sdata[6]
2 8 sdata[8] += sdata[10]
3 12 sdata[12] += sdata[14]
… … …
4 0 0 sdata[0] += sdata[4]
1 8 sdata[8] += sdata[12]
2 16 d t [16] d t [20]
102
2 16 sdata[16] += sdata[20]
3 24 sdata[24] += sdata[28]
… … …
8 0 0 sdata[0] += sdata[8]
1 16 sdata[16] += sdata[24]
2 32 sdata[32] += sdata[40]
3 64 sdata[64] += sdata[72]
… … …
52
13b. Performance 13b. Performance for 4M Element Reductionfor 4M Element Reduction
103
13c. Reduction 13c. Reduction #3: Sequential Addressing#3: Sequential Addressing
104Sequential addressing is shared memory bank conflict free
53
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// h th d l d l t f l b l t h d
13c. Reduction 13c. Reduction #3: Sequential Addressing#3: Sequential Addressing
// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();
// do reduction in shared memfor (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s)sdata[tid] += sdata[tid + s];
105
sdata[tid] += sdata[tid + s];
__syncthreads();}
// write result for this block to global memif (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
13c. Performance 13c. Performance for 4M Element Reductionfor 4M Element Reduction
106
54
13d. Problem13d. Problem: Idle Threads: Idle Threadsfor (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s)sdata[tid] += sdata[tid + s];
Problem: half of the threads are idle on first loop iterationThis is very wasteful
__syncthreads();}
107
13d. Reduction 13d. Reduction #4: First Add During Load#4: First Add During LoadHalve the number of blocks, and replace single load
// each thread loads one element from global to shared memunsigned int tid = threadIdx.x;
With two loads and first add of the reduction
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;sdata[tid] = g_idata[i];__syncthreads();
// perform first level of reduction,// reading from global memory, writing to shared memoryunsigned int tid threadIdx x;
108
unsigned int tid = threadIdx.x;unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];__syncthreads();
This is in effect handling 2 blocks of global memory by 1 block of threads. So, the total number of blocks of threads is half that of the number of global memory blocks.
This is reading the corresponding elements in the 2 blocks of global memory to add.
55
13d. Performance 13d. Performance for 4M Element Reductionfor 4M Element Reduction
109
13e. Instruction 13e. Instruction BottleneckBottleneckAt 17 GB/s, we’re far from bandwidth bound
And we know reduction has low arithmetic intensity
Therefore a likely bottleneck is instruction overheadAncillary instructions that are not loads, stores, or arithmetic for the core computationIn other words: address arithmetic and loop overhead
Strategy: unroll loops
110
56
13e. Reduction 13e. Reduction #5: Unroll The Last Warp#5: Unroll The Last WarpAs reduction proceeds, # “active” threads decreases
When s <= 31, we have only one warp left
Instructions are SIMD synchronous within a warp
That means when s <= 31We don’t need to __syncthreads()We don’t need “if (tid < s)” because it doesn’t save any work, and it does not affect the correctness
111
Let’s unroll the last 6 iterations of the inner loop
13e. Reduction 13e. Reduction #5: Unroll The Last Warp#5: Unroll The Last Warpfor (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)sdata[tid] += sdata[tid + s];syncthreads();__ y ;
}
if (tid < 32) {sdata[tid] += sdata[tid + 32];sdata[tid] += sdata[tid + 16];sdata[tid] += sdata[tid + 8];sdata[tid] += sdata[tid + 4];sdata[tid] += sdata[tid + 2];
112
This saves useless work in all warps, not just the first one!Without unrolling, all warps execute every iteration of the for loop and if statement
sdata[tid] += sdata[tid + 1];}
57
1313e. Performance e. Performance for for 44M Element ReductionM Element Reduction
113
13. Further 13. Further OptimizationsOptimizationsFurther optimizations can be applied
E.g. complete loop unrolling, algorithm cascading
114
58
13. Parallel 13. Parallel Reduction Algorithmic ComplexityReduction Algorithmic ComplexityStep complexity, S(n) = O(log n)
Step complexity is the number of steps required to complete the algorithm given an infinite number of threads/processorsg g p
Work complexity, W(n) = O(n)Work complexity is the total number of operations the algorithm performsOur parallel reduction algorithm is work-efficient (i.e. same as optimal sequential algorithm)
Time complexity
115
Time complexityp is the number of available processorsWhen p → ∞, T(n) → S(n)When p = 1, T(n) = W(n)
14. CUDA Occupancy Calculator14. CUDA Occupancy CalculatorLimits on:
# of threads per blockAmount of shared memory# registers per thread
Occupancy: It is the ratio of active warps to the maximum number of warps supported on a multiprocessor SM.
Aim for 100% occupancy ?
116
Actually, should aim for all CUDA core to be busy
Demonstration on GPU Occupancy Calculatorhttp://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
59
Compute Capability 1.0 1.1 1.2 1.3 2.0 2.1SM Version sm_10 sm_11 sm_12 sm_13 sm_20 sm_21Threads / Warp 32 32 32 32 32 32
9c. Execution Model (different CUDA 9c. Execution Model (different CUDA verver))
Threads / Warp 32 32 32 32 32 32Warps / Multiprocessor 24 24 32 32 48 48Threads / Multiprocessor 768 768 1024 1024 1536 1536Thread Blocks / Multiprocessor 8 8 8 8 8 8Max Shared Memory / Multiprocessor (bytes) 16384 16384 16384 16384 49152 49152
Register File Size 8192 8192 16384 16384 32768 32768
Register Allocation Unit Size 256 256 512 512 64 64
Allocation Granularity block block block block warp warp
Shared Memory Allocation Unit Size 512 512 512 512 128 128
Warp allocation granularity (for registers) 2 2 2 2
Max Thread Block Size 512 512 512 512 1024 1024
Shared Memory Size Configurations (bytes) 16384 16384 16384 16384 49152 49152
[note: default at top of list] 16384 16384
117
14. CUDA Occupancy Calculator14. CUDA Occupancy Calculator
118
60
14. CUDA Occupancy Calculator14. CUDA Occupancy CalculatorMaximum number of register compiled option is:p
-maxregcount N
When there is not enough registers, variable will go into “local” memory, which i t ll th l b l
119
is actually the global memory
Leave it to compiler?
Session (5 of 8)Session (5 of 8)Day 2, Morning, 9:00am – 10:30am – More about Memory
1. Review of CUDA Hardware Implementation
[ Revisit #13 in Session #4]
[ Revisit #14 in Session #4]
2. Overall Optimization
3. Optimization III
120
- Coalescing Memory Access (Global memory)- Avoiding (shared) memory bank conflict
4. CUDA: 6th PP – Matrix Transpose
Break: 10:30am – 11:00am
61
1. Streaming 1. Streaming MultiprocessorsMultiprocessorsCUDA architecture consists of a scalable array of multi-threaded Streaming Multiprocessors (SM)All threads of each thread block are executedAll threads of each thread block are executed concurrently on one multiprocessor
But each multiprocessor can execute multiple blocks concurrently
Each multiprocessor has (in G80)1 multi-threaded instruction unit
121
8 Scalar Processor (SP) cores2 function units for transcendental functions (e.g. log, sin)On-chip memory –– registers, shared memory, constant cache, texture cache
1. Streaming 1. Streaming MultiprocessorsMultiprocessors
122
62
1. Thread Execution1. Thread ExecutionSIMT (single-instruction, multiple-thread) execution model
Multiprocessor creates, manages, schedules and execute threads in warpsthreads in warps
A SIMT warp is a group of 32 parallel threads
When executing a (common) instruction in a warp, multiprocessor maps threads to the ( 8 ) scalar processor cores in turn
One thread to one scalar processor core
123
A warp can be processed in 4 clock cycles
A block is always split into warps the same wayEach warp contains threads of consecutive, increasing thread IDs (0 31, 32 63, …)
1. Flow 1. Flow Control and BranchingControl and BranchingIndividual threads in a warp are free to branch independently
Can have diverging execution pathsCan have diverging execution paths
A warp executes one common instruction at a time, so full efficiency is achieved when all threads in a warp agree on the same execution path
If threads in a warp diverge, the warp serially executes each branch path taken, disabling threads that are not on
124
p , gthat path
63
1. Active 1. Active Threads and BlocksThreads and BlocksNumbers of threads and blocks that can be active at the same time in a multiprocessor are limited by size of register file and shared memoryg y
Each active thread is allocated with some registers for the entire lifetime of the threadEach active block is allocated with some shared memory for the entire lifetime of the block
125
1. Some Spec for 1. Some Spec for Compute Capability 1.1Compute Capability 1.1Maximum threads per block = 512Maximum block size = 512 x 512 x 64Maximum grid size = 65535 x 65535W i 32 th dWarp size = 32 threadsRegisters per multiprocessor = 8192Shared memory per multiprocessor = 16 KB (in 16 banks)Constant memory = 64 KBLocal memory per thread = 16 KBConstant cache = 8 KB per multiprocessorTexture cache = 6 KB to 8 KB per multiprocessor
126
Maximum active blocks per multiprocessor = 8Maximum active warps per multiprocessor = 24Maximum active threads per multiprocessor = 768Limit on kernel size = 2 million PTX instructionsSupport atomic functions on 32-bit words in global memory (1.1)
64
2. Overall 2. Overall Optimization StrategiesOptimization Strategies(1) Maximizing parallel execution
Restructure algorithm to achieve as much data parallelism as possibleMap to hardware by carefully choosing the execution configuration of each kernel invocation
(2) Optimizing memory usage to achieve maximum memory bandwidth
Different memory spaces and access patterns have vastly different performanceWe have a lot to say about this
127
We have a lot to say about this(3) Optimizing instruction usage to achieve maximum instruction
throughputUse high throughput arithmetic instructionsAvoid different execution paths within same warp
2. Execution 2. Execution ConfigurationConfigurationNumber of thread blocks should be larger than the number of multiprocessors
So that all multiprocessors have at least one block to executeShould have multiple blocks per multiprocessor so that it does not idle when a block synchronizes, or accessing memory
Threads per block should be multiple of warp size
Should have at least 192 (= 6 x 32) active threads per multiprocessor to hide register read-after-write latency
Block size is also limited by amount of register and shared memory
128
Block size is also limited by amount of register and shared memory used
65
3. Memory 3. Memory OptimizationsOptimizationsDistinguish between data intensive vs compute intensive
Data intensive: consider memory access carefullyCompute intensive: not so much, as there are always enough to work on
Possibilities:(1) Minimize data transfer between host and device(2) Global memory usage: ensure accesses are coalesced
whenever possible (coming up)
129
whenever possible (coming up)(3) Minimize global memory accesses by using shared
memory, constant memory, etc. (4) Shared memory usage: minimize bank conflicts in
accesses (coming up)
3. Data 3. Data Transfer Between Host and DeviceTransfer Between Host and DevicePeak bandwidth between device memory and GPU is much higher than that between host memory and device memoryy
Should minimize data transfer between host and device memory
Even if that means running kernels on GPU that do not have any speed-up over CPU
Use page-locked or pinned memory transfer
130
p g p y
Overlapping asynchronous transfers with CPU computations
66
3a. Coalesced 3a. Coalesced Access to Global MemoryAccess to Global MemorySimultaneous accesses to global memory by threads in a half-warp (16 threads) can be coalesced into as few as a single memory transaction of 32, 64 or 128 bytesg y , y
Requirements for coalescing are based on compute capability
(1) Compute capability 1.0 and 1.1(2) Compute capability 1.2 and higher
If a half warp fulfills requirements coalescing occurs
131
If a half-warp fulfills requirements, coalescing occurs even if the half-warp is divergent and some threads do not actually access memory
3a. Coalescing: Compute 3a. Coalescing: Compute Capability Capability 1.0 1.0 &1.1&1.1
Global memory accesses by all threads in a half-warp are coalesced if threads access
(1) 4-byte words, resulting in one 64-byte memory transaction(1) 4 byte words, resulting in one 64 byte memory transaction(2) 8-byte words, resulting in one 128-byte memory transaction(3) 16-byte words, resulting in two 128-byte memory transactions
All 16 words must lie in same segment of size equal to the memory transaction size (i.e. 64 bytes, 128 bytes, 256 bytes)
The kth thread in the half-warp must access the kth word
132
Not all threads need to participate
16 separate memory transactions are issued if requirements not fulfilled
Minimum memory transaction size is 32 bytes
67
3a. Examples 3a. Examples of of Coalesced Coalesced Memory AccessMemory Access
C CFor Compute Capability 1.0 and 1.1Left: coalesced floatmemory access, resulting in a single 64-byte memory transaction
133
Right: coalesced float memory access (divergent warp), resulting in a single 64-byte memory transaction
3a. Examples 3a. Examples of of NonNon--Coalesced Coalesced Memory AccessMemory Access
For Compute Capability 1.0 and 1.1
Left: non-sequential floatmemory access, resulting in 16 memory transactions
134
Right: access with a misaligned starting address, resulting in 16 memory transactions
68
3a. Coalescing: Compute 3a. Coalescing: Compute Capability 1.2 Capability 1.2 & Higher& HigherA global memory transaction for a half-warp is determined based on the following rules
Find memory segment that contains the address requested by lowest numbered active threadlowest numbered active thread
Segment size is 32 bytes for 1-byte data, 64 bytes for 2-byte data, 128 bytes for 4-, 8- and 16-byte data
Find all other active threads whose requested address lies in the same segmentReduce the transaction size, if possible
If the transaction size is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes
135
If the transaction size is 64 bytes and only the lower or upper half is used, reduce the transaction size to 32 bytes
Carry out the transaction and mark the serviced threads as inactiveRepeat until all threads in the half-warp are serviced
3a. 3a. EgEg of of Global Global Memory AccessMemory Access
For Compute Capability 1.2 and higherand higher
Left: random float memory access within a 64B segment, resulting in one memory transaction
Center: misaligned floatmemory access, resulting in
t ti
136
one transaction
Right: misaligned floatmemory access, resulting in two transactions
69
3b. Shared 3b. Shared Memory and Memory BanksMemory and Memory BanksVery fast on-chip memory
Can be used to avoid non-coalesced global memory accesses
Can be used to reduce global memory accessesCan be used to reduce global memory accesses
Shared memory is organized into 16 banks in G80 (32 banks in Fermi), where successive 4-byte words are assigned to successive banks
Memory load or store of n addresses by a half-warp that span n distinct memory banks can be serviced simultaneously
137
distinct memory banks can be serviced simultaneously
If multiple addresses map to the same memory bank, the accesses are serialized
If multiple read for the same memory address, a broadcast occurs
3b. 3b. EgEg Without Without Bank ConflictsBank Conflicts
Left: Sequential access
Right: Random permutation
138
70
3b. 3b. EgEg With With Bank ConflictsBank Conflicts
Left: 2-way bank conflicts
Right: 8-way bank conflicts
139
3b. 3b. EgEg With With BroadcastBroadcast
Left: No conflict
Right: No conflict if word in Bank 5 is chosen for broadcast first, or 2-way conflicts otherwise
140
71
3b. A 3b. A Common Technique (Trick)Common Technique (Trick)Example: The following causes 16-way bank conflict if TILE_DIM is multiple of 16...__shared__ float tile[ TILE_DIM ][ TILE_DIM ];...float a = tile[ threadIdx.y ][ column ];
The bank conflict can be removed entirely by padding shared memory array with an extra column
141
...__shared__ float tile[ TILE_DIM ][ TILE_DIM + 1 ];...float a = tile[ threadIdx.y ][ column ];
3b. Shared 3b. Shared Memory StrategyMemory StrategyBetter coalescing means less time waiting for data, which means less latency
I l h ld b k th blIn general, you should break up the problem so you can do coalesced copies into shared memory
Then do processing in shared memory without bank conflicts
Then do a coalesced copy back into global memory
142
72
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeSimple direct copy (not yet transposed)
TILE_DIM = 32BLOCK ROWS = 8_Matrix is of dimension width x height
__global__ voidcopy(float *odata, float* idata, int width, int height){
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
143
int index = xIndex + width * yIndex;
for (int i=0; i < TILE_DIM; i += BLOCK_ROWS) odata[index + i*width] = idata[index + i*width];
}
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeDirect copy using shared memory (not yet transposed)
__global__ void copySharedMem(float *odata, float *idata, int width, int height)copySharedMem(float odata, float idata, int width, int height){
__shared__ float tile[TILE_DIM][TILE_DIM];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;int index = xIndex + width*yIndex;for (int i=0; i < TILE_DIM; i += BLOCK_ROWS)tile[threadIdx.y+i][threadIdx.x] = idata[index + i*width];
144
__syncthreads();for (int i=0; i < TILE_DIM; i += BLOCK_ROWS)odata[index + i*width] = tile[threadIdx.y+i][threadIdx.x];
}
73
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeNaïve transpose
__global__ void transposeNaive (float *odata, float* idata, int width, intheight)
{int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex;int index out = yIndex + height * xIndex;
145
int index_out = yIndex + height * xIndex;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) odata[index_out + i] = idata[index_in + i*width];
}
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeTranspose with coalesced (and with bank conflicts)
__global__ void transposeCoalesced (float *odata, float *idata, int width, intheight)height){ __shared__ float tile[TILE_DIM][TILE_DIM];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index out = xIndex + (yIndex)*height;
146
_ y g
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x]=idata[index_in+i*width];
__syncthreads();for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
odata[index_out+i*height]=tile[threadIdx.x][threadIdx.y+i];}
74
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix TransposeTranspose without bank conflicts
__global__ void transposeNoBankConflicts (float *odata, float *idata, int width, int height)int height){ __shared__ float tile[TILE_DIM][TILE_DIM + 1];int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index out = xIndex + (yIndex)*height;
147
_ y g ;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) tile[threadIdx.y+i][threadIdx.x]=idata[index_in+i*width];
__syncthreads();for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS)
odata[index_out+i*height]=tile[threadIdx.x][threadIdx.y+i];}
4. CUDA: 64. CUDA: 6thth PP PP –– Matrix TransposeMatrix Transpose
148
75
Session (6 of 8)Session (6 of 8)Day 2, Morning, 11:00am – 12:30pm – More on Correctness
5. Review on thread cooperation- CUDA 7th PP – Histogram (Demo)- Cooperation within & across block - Atomic functions
6. Hands-on #5: 8th PP – Centroidal Voronoi Diagram
7. Nvidia Profiler Demo (by Cao Thanh Tung)
L h 12 30 2 00
149
Lunch: 12:30pm – 2:00pm
5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (1 of 2) Histogram (1 of 2) __global__ void kernelHistogram64(int *d_Histogram,
uchar *d_Data, int n){
__shared__ uchar s_hist[BLOCK][HISTOGRAM_SIZE + 1]; __shared__ int s_sum[HISTOGRAM_SIZE];
// Each thread initialize its histogramfor (int i = 0; i < HISTOGRAM_SIZE; i++)
s_hist[threadIdx.x][i] = 0;
int totalThreads = gridDim.x * blockDim.x; int tid = blockIdx.x * blockDim.x + threadIdx.x; // Each thread work on its own set of data
150
// Each thread work on its own set of datafor (int i = tid; i < n; i += totalThreads)
s_hist[threadIdx.x][d_Data[i]]++;
__syncthreads();
76
5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (2 of 2) Histogram (2 of 2)
// Merge the sub-histograms_sum[threadIdx.x] = s_hist[0][threadIdx.x]; for (int i = 1; i < blockDim.x; i++)for (int i = 1; i < blockDim.x; i++)
s_sum[threadIdx.x] += s_hist[i][threadIdx.x];
__syncthreads();
// Merge with the global histogram//d_Histogram[threadIdx.x] += s_sum[threadIdx.x];
151
}
5a. CUDA: 75a. CUDA: 7thth PP PP –– Histogram (2 of 2 Histogram (2 of 2 correctedcorrected) )
// Merge the sub-histograms_sum[threadIdx.x] = s_hist[0][threadIdx.x]; for (int i = 1; i < blockDim.x; i++)for (int i = 1; i < blockDim.x; i++)
s_sum[threadIdx.x] += s_hist[i][threadIdx.x];
__syncthreads();
// Merge with the global histogram// Without using atomic --> Wrong// d_Histogram[threadIdx.x] += s_sum[threadIdx.x];
152
// With atomic --> CorrectatomicAdd(&d_Histogram[threadIdx.x],s_sum[threadIdx.x]);
}
77
5b. Cooperation within & across block5b. Cooperation within & across blockCooperation within a block: use shared memory
Cooperation across block difficultiesSituation #1: Some block may be dead (completed) before some that are born (started)Situation #2: More than one block want to access the “same” information – race condition
Cooperation across block: use global memory that is common to all blocks
153
common to all blocks
5b. Race Condition5b. Race Condition
Eg. Two threads try to increase a global counter
Thread 1: R1 = counter
R1 R1 + 1
counter = R1
Thread 2: R2 = counter
R2 R2 + 1
counter = R2
154
Incorrect result !
78
5c. Atomic Functions5c. Atomic FunctionsAn atomic function performs a read-modify-write atomic operation on one
32-bit or 64-bit word residing in global or shared memory.atomicAdd( address, val) … addition
atomicSub( address, val) … subtraction
atomicExch(address, val) … exchange old with val
atomicMin( address, val) … min of old and val
atomicMax( address, val) … max of old and val
atomicInc( address, val) … (old >= val) ? 0 : (old+1)
atomicDec( address, val) … ((old == 0) | (old > val)) ?
val : (old – 1)
155
atomicCAS( address, compare, val) … compare and store
old==compare ? val:old
bitwise atomicAnd( address, val)… ( old & val )
bitwise atomicOr( address, val) … ( old | val )
bitwise atomicXor( address, val)… ( old ^ val )
5c. Atomic Functions5c. Atomic FunctionsatomicAdd( address, val) … addition
* read address for old
* add to old the val
* store result to address
--- all instruction done in one transaction
atomicCAS( address, compare, val) … compare and store
old==compare ? val:old
* read address for old
* check old with compare
156
* if same, store val, else keep the old value
* return old
--- all instruction done in one transaction
A way of implementing semaphore : expecting compare, else don’t do anything
NOTE: no locking mechanism available…
79
Input: A collection of numPoints colored 2D sites
Output: A map that colors the 2D (digital) space into various regions that “belong” to the sites, with each site at the centroidal of each region.
6. Hands6. Hands--on #5 : 8on #5 : 8thth PP PP –– CentroidalCentroidal VD VD
Processing:repeat until each site at the centroidal of each region
1. compute Voronoi diagram for sites (previous hands-on)
2. for each pixel (i,j) docompute the sum to the centroid for its region as
2.1. int k = voronoi[id]; // id is a running number2.2. sumVectors[k].x += i;2 3 [k] j2.3. sumVectors[k].y += j;2.4. weights[k]++;
(in kernels.cu)
3. execute movePoints() to move to centroid for each site (in main.cpp)
endRepeat
157
6. Hands6. Hands--on #5 : 8on #5 : 8thth PP PP –– CentroidalCentroidal VD VD
glut_idle
Tasks: (1) Complete kernelCentroidsNoAtomic(2) Complete kernelCentroidAtomic
computeVoronoiDiagram
cudaComputeCentroids
g _
CPUcpuComputeCentroids
computeCentroids movePoints glutPostRedisplay
158KernelCentroidsNoAtomic KernelCentroidsAtomic
GPUkernelSharedVoronoiDiagram
CPU
80
7. Demo: 7. Demo: NvidiaNvidia ProfilerProfilerHelps measure and find potential performance problem
GPU and CPU timing for all kernel invocations and memcpysmemcpysIt has access to hardware performance counters
Demo by Cao Thanh Tung
159
Session (7 of 8)Session (7 of 8)Day 2, Afternoon, 2:00pm – 3:30pm – PP Primitive
8. CUDA: 9th PP – Sum of Array- threadfence ( )
9. Optimization IV: Basic Parallel Primitives- Reduction- Parallel Prefix- Compaction- Sorting
160
- Working with CUDA Libraries (Visual Studio)- CUDA: 10th PP – sorting (demo ) w/o code
10.Hands-on #6: 10th PP – Sorting an image with CUDPP
Tea break: 3:30pm – 4:00pm
81
8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (1 of 3(1 of 3))Want just a single kernel to compute the sum of an array of N numbers (not a practical one, but just for the sake of discussion)
Each block first sums a subset of array and stores result in yglobal memoryWhen all blocks are done, the last block done reads these partial sums from global memory and sums them to obtain final resultTo determine which block is last, each block atomically increments a counter
161
The last block is the one that receives counter value equal to gridDim.x – 1A memory fence ( __threadfence( ) ) is used to ensure partial sum is stored before counter is incremented
Cheaper than __syncthreads() operation, as it is synchronize within a warpFrom CUDA C Programming Guide Version 3.2, Appendix B.
8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (2 of 3(2 of 3))__device__ unsigned int count = 0;__shared__ bool isLastBlockDone;
__global__ void sum( const float* array, unsigned int N, fl t* lt)float* result)
{// Each block sums a subset of the input arrayfloat partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {// Thread 0 of each block stores partial sum to global memoryresult[blockIdx.x] = partialSum;// Thread 0 makes sure its result is visible to// all other threads
162
// all other threads__threadfence();// Thread 0 of each block signals that it is doneunsigned int value = atomicInc(&count, gridDim.x);// Thread 0 of each block determines if its block is// the last block to be doneisLastBlockDone = (value == (gridDim.x - 1));
}
82
8. CUDA: 98. CUDA: 9thth PP PP –– Sum of Sum of Array Array (3 of 3(3 of 3))// Synchronize to make sure that each thread reads// the correct value of isLastBlockDone__syncthreads();
if (i L tBl kD ) {if (isLastBlockDone) {// The last block sums the partial sums// stored in result[0 .. gridDim.x-1]float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {// Thread 0 of last block stores total sum// to global memory and resets count so that// next kernel call works properlyresult[0] = totalSum;
163
result[0] = totalSum;count = 0;
}}
}
9. Optimization IV: Basic Parallel Primitives 9. Optimization IV: Basic Parallel Primitives We have previously covered parallel prefix (Session #4)
Examples of Parallel Primitives:ReductionReductionParallel PrefixCompactionSorting
CUDPP is the CUDA Data Parallel Primitives Library.
CUDPP is a library of data-parallel algorithm primitives
164
Primitives such as these are important building blocks for a wide variety of data-parallel algorithms:
83
99. Definitions . Definitions of Basic of Basic OperationsOperationsReduce
out = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[n–1]where ⊕ is associativewhere ⊕ is associative
Parallel Prefix / Scanout[i] = in[0] ⊕ in[1] ⊕ in[2] ⊕ ... ⊕ in[i]
where ⊕ is associativeE.g. all-prefix-sums
(Stream) Compaction
165
out[0 .. m–1] = subset of in[0 .. n–1] that satisfy some conditionwhere m ≤ n
SortNon-decreasing, non-increasing, stable, non-stable, in-placeRadix sort
9. Compaction9. CompactionStream compaction is a filtering operation
Takes an input vector vi and a predicate p, and outputs only those elements in v for which p(v ) is true preservingonly those elements in vi for which p(vi) is true, preserving the ordering of the input elements
166
Parallel stream compaction can be implemented (by our own – non-optimized) using:
Scan and scatter
84
9. Compaction: Using 9. Compaction: Using Scan and ScatterScan and Scatter
(this is a pre-scan)
167
No conflict here because only elements “C” and “G” can write to output array
9. Working with CUDPP (1 of 3)9. Working with CUDPP (1 of 3)
168CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html
85
9. Working with CUDPP (2 of 3)9. Working with CUDPP (2 of 3)Easy stuff: include cudpp32.lib in your system for compilation, include cudpp32_xxxx.dll from CUDA SDK directory
Steps to getting it working (e.g. scan):p g g g ( g )1. Configure cudpp to work with a plan
169CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html
9. Working with CUDPP (3 of 3)9. Working with CUDPP (3 of 3)Steps to getting it working (e.g. scan):
2. Once okay, run the plan to do the scan
3. Next is to read the result back
170
4. Finally, clean up the memory used
86
10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– Sorting ColorsSorting Colors
Demo of sorting of image colors (rgba) for each(rgba) for each row of the image in increasing order
Perform the above sorting steps using the image processing program
Option: 1 (original) Option: 2 (inverted)
171
Results of the 4 options are shown on the left.
Option: 3 (sorted by color) Option: 4 (sorted by color in each row)
CUDPP documentation: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/index.htmlExample: http://cudpp.googlecode.com/svn/tags/1.1.1/cudpp/doc/html/example_simplecudpp.html
10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– Sorting ColorsSorting Colors
glut_keyboard
invertImage
cudaSortImage
CPU
* the usual setting up* prepare sorting plan
sortImage
initializeRows cudppSort(applied twice)
GPU
CPU
kernelInvertImageEach pixel is given a row number to facilitate 2 rounds of sorting
172
87
10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– sorting colorssorting colors
10. Hands10. Hands--on #6: 10on #6: 10thth PP PP –– sorting colorssorting colors
88
Session (8 of 8)Session (8 of 8)Day 2, Afternoon, 4:00pm – 5:30pm – Roundup
11.Roundup Optimization Lecture by Prof Hwu Wen mei- Lecture by Prof Hwu Wen-mei
12. Roundup Theory of Parallel Computation- Amdahl’s Law- Gustafson’s Law- NC - Work Efficiency
13 Roundup Parallel Computational Thinking
175
13. Roundup Parallel Computational Thinking14. Outlook
Keep-in-touch!
11. What we did not cover?11. What we did not cover?Here are some sample of stuff not covered:
C/C++ Runtime for CUDA vs CUDA Driver APIMemory optimizations
E.g. Data Transfer between host and devicePinned memory & zero copy Asynchronous transfers (both CPU & GPU are busy)
Global memory bank conflicts CUDA array (for textures)
CUDA and OpenGL/DirectX interoperability
176
CUDA with OpenGL – only a special type of texture (RGBA, 32 bits x 4) that can be processed w/o reading and writing from one to another
Multi-GPUsWindow OS limits computation to 5 sec; it uses 100-200MB for display purposes.
89
11. Roundup Optimization11. Roundup OptimizationParallel Problem Solving: (1) Sequential solution (2) Sequential solution with the right data structure (3) parallel solution (4) parallel solution with work (algorithmic) optimization (5) parallel solution with hardware considerationhardware consideration
Lecture by Prof Hwu Wen-meihttp://www.gpucomputing.net/?q=node/2376, also: Books: GPU Computing Gems, vol .1 & 2
8 Techniques in many-core programming Scatter to gather transformationPrivatizationGranularity coarsening
177
G a u a y coa se gData tiling/reuseData layout and traversal orderingInput data binningBin compactionRegularization
11. Eight Techniques in Programming11. Eight Techniques in Programming#1: Scatter to Gather: prefer gather for computation
Example: Jump Flooding Algorithm (video)
178
90
11. Eight Techniques in Programming11. Eight Techniques in Programming#2: Privatization : first do your own, then combine
Example: Histogram (session #7)
179
11. Eight Techniques in Programming11. Eight Techniques in Programming#3: Granularity coarsening : less threads to re-use computation
Example: Potential energy calculation
180
91
11. Eight Techniques in Programming11. Eight Techniques in Programming#4: Data Tiling/reuse: using shared memory
Example: matrix multiplication
181
11. Eight Techniques in Programming11. Eight Techniques in Programming#5: Data Layout & Traversal ordering :
Memory optimization examples
182
92
11. Eight Techniques in Programming11. Eight Techniques in Programming#6: Input Data Binning:
Example: gHull
183
11. Eight Techniques in Programming11. Eight Techniques in Programming#7: Bin Compaction
184
93
11. Eight Techniques in Programming11. Eight Techniques in Programming#8: Regularization
Example: graph algorithm (dynamic amount of work)
185
12a. Amdahl’s Law12a. Amdahl’s LawAmdahl’s Law: used to find the max expected improvement to an overall system when only part of the system is improved.
Assumption: Fixed the problem size (or assume the proportion of the p p ( p psequential portion does not change with problem size)
S is the possible speedupp is the fraction of the program that is parallelizableN is the number of parallel processors
S = 1 / [ (1 – p) + p/N ] which isS = 1 / (1 – p) when N goes to infinity
186
S 1 / (1 p) when N goes to infinityE.g. if p is 90%, then S is at most 10x speedup
http://en.wikipedia.org/wiki/Amdahl's_law
94
12a. Amdahl’s Law12a. Amdahl’s Law
187http://en.wikipedia.org/wiki/Amdahl's_law
12a. Gustafson’s Law12a. Gustafson’s LawGustafson’s Law: says that problems with large, repetitive data sets can be efficiently parallelized.
Assumption: Problem size can go very big that the sequential portion p g y g q pbecomes insignificant
S is the possible speedup : S = SequentialTime / ParallelTimeTime taken in a parallel machine of N processors is normalized to 1
There are two components, sequential-part (a) + parallel-part (b):
a + b = 1 The above time converted to sequential machine is then:
188
The above time converted to sequential machine is then:
a + N b = a + N (1 – a) So, speedup is:
S = [ a + N (1 – a) ] / 1 = a + N (1 – a) which isS = N when sequential-part is insignificant
http://en.wikipedia.org/wiki/Gustafson%27s_Law
95
12b. Amdahl’s 12b. Amdahl’s vsvs Gustafson’s Gustafson’s Amdahl’s Law
“ Suppose a car is traveling between two cities 60 miles apart, and has already spent one hour traveling half the distance at 30 mph. No matter h f t d i th l t h lf it i i ibl t hi 90 hhow fast you drive the last half, it is impossible to achieve 90 mph average before reaching the second city. Since it has already taken you 1 hour and you only have a distance of 60 miles total; going infinitely fast you would only achieve 60 mph. ”
Gustafson’s Law“ Suppose a car has already been traveling for some time at less than
90mph. Given enough time and distance to travel, the car's average speed can always eventually reach 90mph no matter how long or how
189
speed can always eventually reach 90mph, no matter how long or how slowly it has already traveled. For example, if the car spent one hour at 30 mph, it could achieve this by driving at 120 mph for two additional hours, or at 150 mph for an hour, and so on.”
http://en.wikipedia.org/wiki/Gustafson%27s_Law
12b. Amdahl’s 12b. Amdahl’s vsvs Gustafson’s Gustafson’s vsvs NvidiaNvidiaGustafson's Law argues that even using massively parallel computer systems does not influence the serial part and regards this part as a constant one.
Amdahl’s Law is resulted from the idea that the influence of the serial part grows with the number of processes.
Nvidia refers to practically what do we have…The above laws neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth (which generally do not scale very well with the number of processors)….recall bank conflict, coalescing etc. Take the laws with a grain of salt : we do not always parallelize a
190
Take the laws with a grain of salt : we do not always parallelize a sequential version of the algorithm
So, what is the best that we can do? Next slide : more theory….
96
12c. NC 12c. NC The class NC (for “Nick’s Class”) is the set of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial # of processors
A problem is in NC if there exist constants c and k such that it can be solved in time O(logcn) using O(nk) parallel processorsStephen Cook coined NC after Prof Nick Pippenger
NC is a subset of PIt is unknown whether NC = P, but most researchers suspect this to be false
191
Parallel computer means “parallel, random-access machine” (PRAM). It can be CRCW, CREW, EREW.
Many development in 1970s, 1980s on PRAM models and results….
Why polylogarithmic time?? logarithmic is best in parallel vs linear is best in sequential
http://en.wikipedia.org/wiki/NC_(complexity)
12d. Work Efficient 12d. Work Efficient Besides the (poly)logarithmic time, we should look for work efficient algorithm (also mean power efficient)
Work = Total-Time x Work-Per-Time-Stepp
Eg. Jump Flooding Algorithm (video)
Time is O(log n) where there are n x n pixels
192
( g ) p
Is this work efficient ? Work = O(log n) x n x n = n2 log n What is the consequence?
Parallel Banding Algorithm (PBA)
Other example: Work Efficient Parallel Prefix
97
13. Roundup Parallel Computational Thinking13. Roundup Parallel Computational ThinkingComputational thinking is the thought process of formulating domain problems in terms of computation steps/algorithms.
Skill set to be effective computational thinker (from Kirk & Hwu):p ( )Computer architecture – memory organization; caching & locality, memory bandwidth; SIMT, SPMD, SIMD; floating point precision vs accuracyProgramming models – types of available memories, concepts to think through the data organization and loop structuresAlgorithm techniques – understanding tiling, cutoff, binning, etc.
193
as toolbox, and implications of scalability, efficiency and memory bandwidthDomain knowledge of the problems – deviation from the sequential computation may be possible
13. Roundup Parallel Computational Thinking13. Roundup Parallel Computational ThinkingChallenge Issue Attempt Coverage Remark
CorrectnessAlways produce the correct answer each time it runs
(1) Indexing mechanism * divide logically the global memory into logical parts to be processed by blocks of threads
Session #1: vector addition (simple)Session #2: image processing (simple)Session #5: matrix transpose (complex)
* error-prone in programming* wishing for a visual programming
tool
(2) Race condition * use of atomic functions Session #6: atomic examples * evolving with technology* wishing for help by compiler
PerformanceAlways run much faster than a sequential solutionwithin the available computing resources
(1) Algorithm optimization
Regularize work, balance work
* high level redesign of (work efficient) algorithm
* low level adjustment to thread assignment to part of global memory
gHull, pba by G3 LabSession #4: parallel prefix (ver 3)
Session #7: using primitive routine
* limited by our imagination
(2) Data structure /memory optimization
Regularize & localize data, no conflict in update
* use of intermediate memory (texture, shared, etc.)
* avoid (shared memory) bank conflicts* plan for coalesced (global) memory
transactions
Session #3: maxdist, Voronoi diagramSession #5: matrix transpose Session #5: matrix transpose
* wishing for a smart compiler
(3) Limits from Theory * Amdahl’s Law* G f ’ L S i #8 l th
* any other theory?
194
* Gustafson’s Law* NC
Session #8: general theory
ScalabilityAlways can handle growing amounts of work in a graceful manner
Data too big to fit into GPU
* related to efficiency * batch processing* out-of-core processing
* no particular research here
98
14. Outlook14. OutlookUS Stock: if you buy, check out the trend of nVidia, and the (opposite?) trend of Intel ☺ There are many more parallel computers in the recent year than that was sold in the past 30 years. GPU takes root.
Software Software companies are “re architecturing” software to takeSoftware companies are re-architecturing software to take advantage of GPUs
Professionals Upcoming generation of software engineers should be thinking in parallel to be really competitive
TechnologyConvergent of GPU with CPU
195
gOpenCL vs CUDA (vs BrookGPU, etc.) nVidia vs AMD/ATI, nVidia vs Intel
ResearchBetter compiler, visual development toolkits?Better understanding of parallel problem solving strategiesDiscrete (GPU) vs Continuous (real) problem understanding
15. Q & A15. Q & A
196
99
ReferencesReferencesNVIDIA CUDA Programming Guide, Version 3.1
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf
NVIDIA CUDA C Programming Best Practices Guide, Version 3.1http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_BestPracticesGuide_3.1.pdf
Mark Harris, Optimizing Parallel Optimizing Parallel Reduction in CUDAhttp://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdfCan also be found in CUDA SDK projects\reduction\doc folder
GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDAhttp://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
GPU G Ch t 37 A T lkit f C t ti GPU
197
GPU Gems, Chapter 37: A Toolkit for Computation on GPUshttp://http.developer.nvidia.com/GPUGems/gpugems_ch37.html
GPU Gems 2, Chapter 32: Taking the Plunge into GPU Computing, Section 32.3http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.html
GPU Gems 2, Chapter 36: Stream Reduction Operations for GPGPU Applications, Section 36.1http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter36.html