getting started with cuda c/c++ - centre for astrophysics...
TRANSCRIPT
![Page 1: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/1.jpg)
Getting Started with CUDA C/C++
Michael Wang
UniMelb/NVIDIA
![Page 2: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/2.jpg)
Thank You to Our Sponsors
![Page 3: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/3.jpg)
GPU CPU
GPGPU Co-Processing
![Page 4: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/4.jpg)
ONCE UPON A TIME…
![Page 5: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/5.jpg)
Past Massively Parallel Supercomputers
Thinking Machine
MasPar
Cray 2
Goodyear MPP
![Page 6: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/6.jpg)
1.31 TFLOPS on
DGEMM
Today
![Page 7: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/7.jpg)
Made Possible by Graphics
1992
1993
1996
1997
1999
2004 2011
2013
![Page 8: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/8.jpg)
2013 1999
CPU
GPU
Fully Programmable
Pipeline
Leveraging The Power of Graphics
Fixed Function Pipeline
GENERAL PURPOSE
![Page 9: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/9.jpg)
GPU CPU
GPGPU Co-Processing
![Page 10: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/10.jpg)
Low Latency vs High Throughput
CPU
Optimized for low-latency access to
cached data sets
Control logic for out-of-order and
speculative execution
GPU
Optimized for data-parallel,
throughput computation
Architecture tolerant of memory
latency
More transistors dedicated to
computation
![Page 11: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/11.jpg)
Low Latency or High Throughput?
CPU architecture must minimize latency within each thread
GPU architecture hides latency with computation from other thread warps
GPU Stream Multiprocessor – High Throughput Processor
CPU core – Low Latency Processor
Computation Thread/Warp
Tn
Processing
Waiting for data
Ready to be processed
Context switch
W1
W2
W3
W4
T1
T2
T3
T4
![Page 12: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/12.jpg)
K20X: 3x Faster Than Fermi
0.17 0.43
1.22
0
0.5
1
1.5
Xeon E5-2687Wc(8 core, 3.1 Ghz)
Tesla M2090 (Fermi) Tesla K20X
DGEMM TFlops
![Page 13: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/13.jpg)
GPUs: Two Year Heart Beat
16
2
4
6
8
10
12
14
GFLO
PS p
er
Watt
2008 2010 2012 2014
Tesla Fermi
Kepler
Maxwell
![Page 14: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/14.jpg)
CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirect SMX Dynamic Parallelism HyperQ
Programming
Approaches “Drop-in” Acceleration
Programming
Languages
OpenACC
Directives
Maximum Flexibility Easily Accelerate Apps
Development
Environment
Nsight IDE Linux, Mac and Windows
GPU Debugging and Profiling
CUDA-GDB debugger
NVIDIA Visual Profiler
Open Compiler
Tool Chain Enables compiling new languages to CUDA platform, and
CUDA languages to other architectures
Libraries
![Page 15: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/15.jpg)
Getting Started with CUDA
![Page 16: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/16.jpg)
GPU Accelerated Science Applications
Over 110+ Accelerated science apps in our catalog. Just a few:
www.nvidia.com/teslaapps
AMBER GROMACS LAMMPS
Tsunami RTM NWChem
![Page 17: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/17.jpg)
GPU Accelerated Workstation
Applications Fifty accelerated workstation apps in our catalog. Just a few:
www.nvidia.com/object/gpu-accelerated-applications.html
![Page 18: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/18.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
![Page 19: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/19.jpg)
Why should you use libraries?
No need to reinvent the wheel
Implement complex algorithms
Deal with details of the platform
High Performance
Layers of optimizations
In depth knowledge of architecture
Low Maintenance
Rigorous testing/quality-assurance
Have someone to file bugs against
![Page 20: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/20.jpg)
GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal Image Processing
GPU Accelerated Linear Algebra
Matrix Algebra on GPU and Multicore NVIDIA cuFFT
C++ STL Features for CUDA
Sparse Linear Algebra IMSL Library
Building-block Algorithms for CUDA
ArrayFire Matrix Computations
![Page 21: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/21.jpg)
Explore the CUDA (Libraries) Ecosystem
CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone: developer.nvidia.com/cuda-tools-ecosystem
Watch GTC On Demand Talks www.gputechconf.com/gtcnew/on-demand-
gtc.php
![Page 22: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/22.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
![Page 23: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/23.jpg)
OpenACC Directives
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs &
multicore CPUs
Come to Paul’s talk
after lunch
OpenACC
Compiler
Hint
![Page 24: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/24.jpg)
OpenACC Specification and Website
Full OpenACC 1.0 Specification available online
openacc.org
OpenACC 2.0 Specification recently announced
Available for public comment
Implementations available from PGI, Cray, and
CAPS
![Page 25: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/25.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
![Page 26: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/26.jpg)
GPU Programming Languages
OpenACC, CUDA Fortran Fortran
OpenACC, CUDA C C
OpenACC, Thrust, CUDA C++ C++
PyCUDA, Copperhead, NumbaPro, Python
…Anaconda
Alea.cuBase F#
MATLAB, Mathematica, LabVIEW Numerical analytics
Come to Amr’s talk
after lunch
![Page 27: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/27.jpg)
Programming a CUDA Language
CUDA C/C++
Based on industry-standard C/C++
Small set of extensions to enable heterogeneous programming
Straightforward APIs to manage devices, memory etc.
We run a half-day to full day CUDA EASY Workshop that
introduces CUDA C/C++ and OpenACC (for C) Fully interactive, with hands-on demos
More details on this at the end!
![Page 28: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/28.jpg)
Prerequisites
You (probably) need experience with C or C++
Our CUDA EASY Workshops assume ZERO actual C/C++ experience, but a
working understanding of programming (e.g. MATLAB, R, Mathematica)
You don’t need GPU experience
You don’t need parallel programming experience
You don’t need graphics experience
![Page 29: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/29.jpg)
CUDA 5 Toolkit and SDK www.nvidia.com/getcuda
![Page 30: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/30.jpg)
Heterogeneous Computing
Terminology:
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Host Device
PCIe
![Page 31: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/31.jpg)
void vecAddCPU (int n,
float *a,
float *b,
float *c)
{
for (int i = 0; i < n; ++i)
c[i] = a[i] + b[i];
}
...
// Perform vecAddCPU on 1M elements
vecAddCPU (4096*256, a, b, c);
__global__
void vecAddGPU (int n,
float *a,
float *b,
float *c)
{
int i = blockIdx.x*blockDim.x +
threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
...
// Perform vecAddGPU on 1M elements
vecAddGPU<<<4096,256>>>(n, a, b, c);
Vector Addition – CUDA C
Standard C Code CUDA C Code
![Page 32: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/32.jpg)
Parallelism on a GPU – CUDA Blocks
A function which runs on a GPU is called a “kernel”
Each parallel invocation of a function running on the GPU is called a “block”
A block can identify itself by reading blockIdx.x In the above example, gridDim.x = N
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = N-1
…
Grid0
…
![Page 33: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/33.jpg)
Parallelism on a GPU – CUDA Threads
Each block is then broken up into “threads”
A thread can identify itself by reading threadIdx.x
The total number of threads per block can be read with blockDim.x In the above example blockDim.x = M
threadIdx.x = 0 threadIdx.x = 1 threadIdx.x = 2 threadIdx.x = M - 1
…
Bloc
k Block0
![Page 34: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/34.jpg)
Why threads and blocks?
Threads within a block can Communicate very quickly (share memory)
Synchronize (wait for all threads to catch up)
![Page 35: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/35.jpg)
Why threads and blocks?
Why break up into blocks?
A block cannot be broken up among multiple SMs (streaming multiprocessors), and you want to keep all SMs busy.
Allows the HW to scale the number of blocks running in parallel based on GPU capability
GPU X
Block 1 Block 3
Block 2 Block 4
Block 5
Block 6 Block 7
Block 8
GPU Y
Block 1 Block 3
Block 2
Block 4
Block 5
Block 6
Block 7 Block 8 Time
![Page 36: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/36.jpg)
Controlling Parallelism in Kernel
vecAddGPU<<<1,1024>>>(a, b, c, n)
vecAddGPU<<<1024,1024>>>(a, b, c, n)
vecAddGPU<<<2048,512>>>(a, b, c, n)
vecAddGPU<<<4096,256>>>(a, b, c, n)
Triple chevron contains the “kernel launch parameters”
1st parameter defines number of blocks (hamster cages) to use
2nd parameter defines number of threads per block (hamsters per cage) to populate
There is a hardware limit to these. GPU architecture dependent
![Page 37: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/37.jpg)
Kepler GK110 Block Diagram
![Page 38: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/38.jpg)
VecAdd CPU Function
void vecAddCPU (int n, float *a, float *b, float *c) {
for (int i = 0; i < n; ++i)
c[i] = a[i] + b[i];
}
![Page 39: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/39.jpg)
__global__
void vecAddGPU (int n, float *a, float *b, float *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
VecAdd GPU Kernel
![Page 40: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/40.jpg)
__global__
void vecAddGPU (int n, float *a, float *b, float *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
VecAdd GPU Kernel
…
Block
i > n-1 i = 2 i = 1 i = 0
![Page 41: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/41.jpg)
__global__
void vecAddGPU (int n, float *a, float *b, float *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
blockIdx.x:
Our Block ID
blockDim.x:
Number of threads per block
threadIdx.x:
Our thread ID
i is now an index into our input and output arrays
VecAdd GPU Kernel
![Page 42: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/42.jpg)
__global__
void vecAddGPU (int n, float *a, float *b, float *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
Let’s work with 30 data elements
Broken into 3 blocks, with 10 threads per block
So, blockDim.x = 10
VecAdd GPU Kernel – Data Accesses
![Page 43: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/43.jpg)
__global__
void vecAddGPU (int n, float *a, float *b, float *c) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
For blockIdx.x = 0 i = 0 * 10 + threadIdx.x = {0,1,2,3,4,5,6,7,8,9}
For blockIdx.x = 1 i = 1 * 10 + threadIdx.x = {10,11,12,13,14,15,16,17,18,19}
For blockIdx.x = 2 i = 2 * 10 + threadIdx.x = {20,21,22,23,24,25,26,27,28,29}
10 threads (hamsters)
each with a different
VecAdd GPU Kernel – Data Accesses
![Page 44: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/44.jpg)
VecAdd More Efficiently… with Thrust
High-level
Parallel analogue
to STL
Performance-portable
Productive
Come to Luke’s talk after lunch
![Page 45: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/45.jpg)
How to get started?
Profile your existing CPU code
Look for the time hogs!
Is it a good candidate?
Nested loops, lots of parallelism, data independence..
Don’t have to port the whole code at once!
Port in steps
Analyze, Parallelize, Optimize, Deploy!
![Page 46: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/46.jpg)
NVIDIA® Nsight™ Eclipse Edition
for Linux and MacOS ,
CUDA-Aware Editor
Automated CPU to GPU code refactoring
Semantic highlighting of CUDA code
Integrated code samples & docs
Nsight Debugger
Simultaneously debug CPU and GPU
Inspect variables across CUDA threads
Use breakpoints & single-step debugging
Nsight Profiler
Quickly identifies performance issues
Integrated expert system
Source line correlation
developer.nvidia.com/nsight
![Page 47: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/47.jpg)
NVIDIA® Nsight™ Visual Studio Ed.
System Trace
Review CUDA activities across CPU and GPU
Perform deep kernel analysis to detect factors limiting maximum performance
CUDA Profiler
Advanced experiments to measure memory utilization, instruction throughput and stalls
CUDA Debugger
Debug CUDA kernels directly on GPU hardware
Examine thousands of threads executing in parallel
Use on-target conditional breakpoints to locate errors
CUDA Memory Checker
Enables precise error detection
,
![Page 48: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/48.jpg)
NVIDIA Visual Profiler
![Page 49: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/49.jpg)
Links to get started
Get CUDA: www.nvidia.com/getcuda
Nsight IDE: www.nvidia.com/nsight
Programming Guide/Best Practices…
docs.nvidia.com
Questions:
NVIDIA Developer forums devtalk.nvidia.com
Search or ask on www.stackoverflow.com/tags/cuda
General: www.nvidia.com/cudazone
![Page 50: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/50.jpg)
Local Meetup Group
Monthly get-togethers
Talks / Guest Speakers
Prizes
www.meetup.com/Melbourne-GPU-Users/
![Page 51: Getting Started with CUDA C/C++ - Centre for Astrophysics ...astronomy.swin.edu.au/.../Swin_Getting_Started_with_CUDA_static.pdf · Getting Started with CUDA C/C++ Michael Wang](https://reader031.vdocuments.us/reader031/viewer/2022021821/5b04868d7f8b9a3c378dc53d/html5/thumbnails/51.jpg)
Last of all, and BEST of all…
Sign up for the next CUDA EASY Workshop, happening at:
Swinburne University
Monash University
University of Melbourne
…your Uni…
…your company…
FREE!