![Page 1: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/1.jpg)
![Page 2: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/2.jpg)
The “New” Moore’s Law
• Computers no longer get faster, just wider
• You must re-think your algorithms to be parallel !
• Data-parallel computing is most scalable solution
![Page 3: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/3.jpg)
The “New” Moore’s Law
![Page 4: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/4.jpg)
Enter the GPU
• Massive economies of scale
• Massively parallel
![Page 5: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/5.jpg)
5
Graphical processors
• The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor Programmability Precision Power
• GPGPU: an emerging field seeking to harness GPUs for general-purpose computation
![Page 6: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/6.jpg)
Parallel Computing on a GPU
• 8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications Available in laptops, desktops, and clusters
• GPU parallelism is doubling every year• Programming model scales transparently
• Multithreaded SPMD model uses application data parallelism and thread parallelism
GeForce 8800
Tesla S870
Tesla D870
![Page 7: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/7.jpg)
7
Computational Power
• GPUs are fast… 3.0 GHz dual-core Pentium4: 24.6 GFLOPS NVIDIA GeForceFX 7800: 165 GFLOPs 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster CPUs: 1.4× annual growth GPUs: 1.7×(pixels) to 2.3× (vertices) annual growth
![Page 8: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/8.jpg)
CPU vs GPU
![Page 9: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/9.jpg)
9
Flexible and Precise
• Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support
• Modern GPUs support high precision 32 bit floating point throughout the pipeline High enough for many (not all) applications
![Page 10: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/10.jpg)
10
GPU for graphics
• GPUs designed for & driven by video games Programming model unusual Programming idioms tied to computer graphics Programming environment tightly constrained
• Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret
![Page 11: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/11.jpg)
11
General purpose GPUs
• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation
• Example applications range from in-game physics simulation to conventional computational science
• Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
![Page 12: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/12.jpg)
Previous GPGPU Constraints• Dealing with graphics API
Working with the corner cases of the graphics API
• Addressing modes Limited texture size/dimension
• Shader capabilities Limited outputs
• Instruction sets Lack of Integer & bit ops
• Communication limited Between pixels
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
![Page 13: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/13.jpg)
Enter CUDA
• Scalable parallel programming model
• Minimal extensions to familiar C/C++ environment
• Heterogeneous serial-parallel computing
![Page 14: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/14.jpg)
Sound Bite
GPUs + CUDA =
The Democratization of Parallel Computing
Massively parallel computing has become a commodity technology
![Page 15: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/15.jpg)
MOTIVATION
146X
Interactive visualization of
volumetric white matter connectivity
36X
Ionic placement for molecular dynamics simulation on GPU
19X
Transcoding HD video stream to H.264
17X
Fluid mechanics in Matlab using .mex file
CUDA function
100X
Astrophysics N-body simulation
149X
Financial simulation of LIBOR model with
swaptions
47X
GLAME@lab: an M-script API for GPU
linear algebra
20X
Ultrasound medical imaging for cancer
diagnostics
24X
Highly optimized object oriented
molecular dynamics
30X
Cmatch exact string matching to find
similar proteins and gene sequences
![Page 16: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/16.jpg)
Cell Phone RF Simulation
Computational Chemistry
Neurological Modeling
3D CTUltrasound
4.6 Days
27 Minutes
2.7 Days
30 Minutes
8 Hours
13 Minutes16 Minutes
3 Hours
CPU Only Heterogeneous with Tesla GPU
![Page 17: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/17.jpg)
GPUs: Turning Point in Supercomputing
Tesla Personal Supercomputer
$10,000CalcUA
$5 Million
Source: University of Antwerp, Belgium
Desktop beats Cluster
4 GPUsvs
256 CPUs
![Page 18: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/18.jpg)
CUDA: ‘C’ FOR PARALLELISMvoid saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
![Page 19: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/19.jpg)
So far, today..
• GPU – powerful coprocessors• CUDA – programming model for GPU
• Easier to parallelize on GPUs• CUDA extends GPU to general purpose computing
• Now we shall look at the thread programming and memory structure on GPU
![Page 20: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/20.jpg)
Hierarchy of concurrent threads
• Parallel kernels composed of many threads all threads execute the same sequential program
• Threads are grouped into thread blocks threads in the same block can cooperate
• Threads/blocks have unique IDs
Thread t
t0 t1 … tBBlock b
Kernel foo()
. . .
![Page 21: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/21.jpg)
Hierarchical organizationThread
per-threadlocal memory
Block
per-blockshared
memory
Kernel 0
. . .per-device
globalmemory
. . .
Kernel 1
. . .Global barrier
Local barrier
![Page 22: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/22.jpg)
Heterogeneous Programming• CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks
across multiple processing elementsSerial Code
. . .
. . .
Parallel Kernelfoo<<< nBlk, nTid
>>>(args);Serial Code
Parallel Kernel bar<<<nBlk, nTid>>>(args);
![Page 23: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/23.jpg)
What is a thread?
• Independent thread of execution has its own PC, variables (registers), processor state,
etc. no implication about how threads are scheduled
• CUDA threads might be physical threads as on NVIDIA GPUs
• CUDA threads might be virtual threads might pick 1 block = 1 physical thread on multicore
CPU
![Page 24: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/24.jpg)
What is a thread block?
• Thread block = virtualized multiprocessor freely choose processors to fit data freely customize for each kernel launch
• Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want
• Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions
![Page 25: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/25.jpg)
Blocks must be independent
• Any possible interleaving of blocks should be valid presumed to run to completion without pre-emption can run in any order can run concurrently OR sequentially
• Blocks may coordinate but not synchronize shared queue pointer: OK shared lock: BAD … can easily deadlock
• Independence requirement gives scalability
![Page 26: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/26.jpg)
Levels of parallelism
• Thread parallelism each thread is an independent thread of execution
• Data parallelism across threads in a block across blocks in a kernel
• Task parallelism different blocks are independent independent kernels
![Page 27: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/27.jpg)
Block = virtualized multiprocessor
• Provides programmer flexibility freely choose processors to fit data freely customize for each kernel launch
• Thread block = a (data) parallel task all blocks in kernel have the same entry point but may execute any code they want
• Thread blocks of kernel must be independent tasks program valid for any interleaving of block executions
![Page 28: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/28.jpg)
Scalable Execution ModelKernel launched by host
. . .
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device Memory
Blocks Run on Multiprocessors
![Page 29: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/29.jpg)
Synchronization & Cooperation
• Threads within block may synchronize with barriers… Step 1 …__syncthreads();… Step 2 …
• Blocks coordinate via atomic memory operations e.g., increment shared queue pointer with atomicInc()
• Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);
![Page 30: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/30.jpg)
CUDA Memories
![Page 31: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/31.jpg)
31
G80 Implementation of CUDA Memories
• Each thread can: Read/write per-thread registers Read/write per-thread local
memory Read/write per-block shared
memory Read/write per-grid global
memory Read/only per-grid constant
memory• The host can R/W global,
constant, and texture memories
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
![Page 32: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/32.jpg)
32
Thread
Local Memory
Grid 0
. . .Global
Memory
. . .
Grid 1SequentialGridsin Time
Block
SharedMemory
![Page 33: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/33.jpg)
Memory model
Device 0memory
Device 1memory
Host memory
![Page 34: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/34.jpg)
34
A Common Programming Strategy• Global memory resides in device memory (DRAM) - much
slower access than shared memory• So, a profitable way of performing computation on the device is
to tile data to take advantage of fast shared memory: Partition data into subsets that fit into shared memory Handle each data subset with one thread block by:
Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
Copying results from shared memory to global memory
![Page 35: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/35.jpg)
35
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory (DRAM) - much slower access than shared memory But… cached! Highly efficient access for read-only data
• Carefully divide data according to access patterns R/Only constant memory (very fast if in cache) R/W shared within Block shared memory (very fast) R/W within each thread registers (very fast) R/W inputs/results global memory (very slow)
![Page 36: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/36.jpg)
Is that all??
• No!!• Memory Coalescing• Bank conflicts
![Page 37: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/37.jpg)
Memory Coalescing
• When accessing global memory, peak performance utilization occurs when all threads access continuous memory locations.
Md Nd
WID
TH
WIDTH
Thread 1Thread 2
Not coalesced coalesced
![Page 38: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/38.jpg)
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in C
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access direction in Kernel code
…
![Page 39: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/39.jpg)
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in C
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2
Access direction in Kernel code
…
![Page 40: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/40.jpg)
Parallel Memory Architecture for Shared memory• In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth
• Each bank can service one address per cycle A memory can service as many simultaneous
accesses as it has banks
• Multiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serialized
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
![Page 41: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/41.jpg)
Bank Addressing Examples
• No Bank Conflicts Linear addressing
stride == 1
• No Bank Conflicts Random 1:1 Permutation
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
![Page 42: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/42.jpg)
Bank Addressing Examples• 2-way Bank Conflicts
Linear addressing stride == 2
• 8-way Bank Conflicts Linear addressing
stride == 8
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
![Page 43: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/43.jpg)
Summary of CUDA programming tips
• Divide the overall task between concurrent non-communicating threads.
• Design a coalesced access of global memory• Avoid bank-conflicts when accessing shared memory
![Page 44: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/44.jpg)
Programming on CUDA
![Page 45: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/45.jpg)
Basic steps
• Transfer data from CPU to GPU • Explicitly call the GPU kernel designed CUDA will implicitly assign threads to each
multiprocessor, and assign resources for computations• Transfer results back from GPU to CPU
![Page 46: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/46.jpg)
CPU vs GPU
• CPU – operation intensive Goal: reduce number of operations performed at the
expense of additional memory access• GPU – memory intensive Goal: reduce the number of memory accesses at the
expense of additional operations.
![Page 47: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/47.jpg)
Memory model
Device 0memory
Device 1memory
Host memory cudaMemcpy()
![Page 48: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/48.jpg)
CUDA: Minimal extensions to C/C++• Declaration specifiers to indicate where things live
__global__ void KernelFunc(...); // kernel callable from host__device__ void DeviceFunc(...); // function callable on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // in per-block shared memory
• Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each
• Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;
• Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization
![Page 49: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/49.jpg)
CUDA: Features available on GPU• Standard mathematical functions
sinf, powf, atanf, ceil, min, sqrtf, expf
erfc
And many more standard mathematical functions
![Page 50: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/50.jpg)
CUDA: Runtime support• Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
• Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy(), cudaMemcpy2D(), ...
• Texture managementcudaBindTexture(), cudaBindTextureToArray(), ...
• OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
![Page 51: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/51.jpg)
Example: Vector Addition Kernel// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
![Page 52: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/52.jpg)
Example: Host code for vecAdd// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …;
// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float),
cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float),
cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
![Page 53: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/53.jpg)
CUDA libraries – CUFFT & CUBLAS
![Page 54: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/54.jpg)
CUBLAS and CUFFT
• Standard libraries with development kit• CUBLAS CUDA version of blas Available for single- and double- precision for real and
complex numbers Double version – slower
• CUFFT FFT & IFFT on CUDA Faster than the fastest CPU algorithm
![Page 55: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/55.jpg)
CUBLAS
• 3 classes1. Vector operations Vector addition, norm, dot-product, etc
2. Matrix-vector operations Matrix-vector product for symmetric and normal
matrices, etc3. Matrix-matrix operations Matrix multiplication, etc
![Page 56: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/56.jpg)
Advantage
• Highly optimized design• Usable as standard C/C++/Fortran libraries
• Caters the needs in many scientific computing tasks
![Page 57: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/57.jpg)
OpenCL
![Page 58: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/58.jpg)
CUDA: An Architecture for Massively Parallel Computing
ATI’s Compute “Solution”
![Page 59: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/59.jpg)
OpenCL vs. C for CUDA
Shared back-end compiler & optimization technology
OpenCL
C for CUDA
PTX
GPU
Entry point for developers who prefer high-level C
Entry point for developers who
want low-level API
![Page 60: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/60.jpg)
![Page 61: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/61.jpg)
Recall: GPU and CUDA
• GPU – developed for accelerating graphics• CUDA – developed to harness the power of GPUs for
general purpose applications Like C in syntax
• GPU – not a panacea Used in a master-slave scenario with CPU (host) as
master
![Page 62: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/62.jpg)
62
Recall: GPU memories• Each thread can: Read/write per-thread registers Read/write per-thread local memory Read/write per-block shared memory Read/write per-grid global memory Read/only per-grid constant memory
• Divide data according to accesspatterns R/Only constant memory (very fast
if in cache) R/W shared within Block shared
memory (very fast) R/W within each thread registers
(very fast) R/W inputs/results global memory
(very slow)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
![Page 63: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/63.jpg)
63
Thread
Local Memory
Grid 0
. . .Global
Memory
. . .
Grid 1SequentialGridsin Time
Block
SharedMemory
Recall: Thread organization
![Page 64: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/64.jpg)
Recall: Heterogeneous programming \\ CPU codescudaMalloc() \\ allocate memories on device
cudaMemcpy() \\ transfer input data to device
Kernel<<blocks,threads>>() \\ call cuda kernels
\\ kernels are functions evaluated on a single thread
cudaMemcpy() \\ transfer results from device
Keywords: __global__, __shared__, __device__Special math functions: sinf, expf, min, etc
![Page 65: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/65.jpg)
Case Study: Matrix Multiplication
![Page 66: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/66.jpg)
Matrix Multiplication Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){// Calculate the row index of the Pd element and Mint Row = blockIdx.y*TILE_WIDTH + threadIdx.y;// Calculate the column idenx of Pd and Nint Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;// each thread computes one element of the block sub-matrixfor (int k = 0; k < Width; ++k)Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;
}
![Page 67: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/67.jpg)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Constant Memory
How about performance on G80?
• All threads access global memory for their input matrix elements Two memory accesses (8 bytes) per
floating point multiply-add 4B/s of memory bandwidth/FLOPS 4*346.5 = 1386 GB/s required to
achieve peak FLOP rating 86.4 GB/s limits the code at 21.6
GFLOPS• The actual code runs at about 15
GFLOPS• Need to drastically cut down memory
accesses to get closer to the peak 346.5 GFLOPS
![Page 68: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/68.jpg)
68
Use Shared Memory to reuse global memory data• Each input element is
read by Width threads.• Load each element into
Shared Memory and have several threads use the local version to reduce the memory bandwidth Tiled algorithms
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
![Page 69: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/69.jpg)
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Tiled Multiply
• Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd
![Page 70: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/70.jpg)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE498AL University of Illinois
70
Pd1,0
A Small Example
Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0Pd3,0
Nd0,3Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3Pd3,3Pd1,3
![Page 71: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/71.jpg)
Every Md and Nd Element is used exactly twice in generating a 2X2 tile of P
P0,0
thread0,0
P1,0
thread1,0
P0,1
thread0,1
P1,1
thread1,1
M0,0 * N0,0 M0,0 * N1,0 M0,1 * N0,0 M0,1 * N1,0
M1,0 * N0,1 M1,0 * N1,1 M1,1 * N0,1 M1,1 * N1,1
M2,0 * N0,2 M2,0 * N1,2 M2,1 * N0,2 M2,1 * N1,2
M3,0 * N0,3 M3,0 * N1,3 M3,1 * N0,3 M3,1 * N1,3
Accessorder
![Page 72: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/72.jpg)
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0Pd3,0
Nd0,3Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3Pd3,3Pd1,3
Breaking Md and Nd into Tiles
![Page 73: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/73.jpg)
First-order Size Considerations in G80
• Each thread block should have many threads TILE_WIDTH of 16 gives 16*16 = 256 threads
• There should be many thread blocks A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks
• Each thread block perform 2*256 = 512 float loads from global memory for 256 * (2*16) = 8,192 mul/add operations. Memory bandwidth no longer a limiting factor
![Page 74: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/74.jpg)
CUDA Code – Kernel Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);dim3 dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
![Page 75: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/75.jpg)
Tiled Matrix Multiplication Kernel__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){1. __shared__float Mds[TILE_WIDTH][TILE_WIDTH];2. __shared__float Nds[TILE_WIDTH][TILE_WIDTH];
3. int bx = blockIdx.x; int by = blockIdx.y;4. int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on5. int Row = by * TILE_WIDTH + ty;6. int Col = bx * TILE_WIDTH + tx;
7. float Pvalue = 0;// Loop over the Md and Nd tiles required to compute the Pd element8. for (int m = 0; m < Width/TILE_WIDTH; ++m) {// Coolaborative loading of Md and Nd tiles into shared memory9. Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];10. Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];11. __syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)12. Pvalue += Mds[ty][k] * Nds[k][tx];13. Synchthreads();14. }13. Pd[Row*Width+Col] = Pvalue;}
![Page 76: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/76.jpg)
Md
Nd
Pd
Pdsub
TILE_WIDTH
WIDTHWIDTH
TILE_WIDTHTILE_WIDTH
bx
tx01 TILE_WIDTH-12
0 1 2
by ty 210
TILE_WIDTH-1
2
1
0
TIL
E_W
IDT
HT
ILE
_WID
TH
TIL
E_W
IDT
HE
WID
TH
WID
TH
Tiled Multiply
• Each block computes one square sub-matrix Pdsub of size TILE_WIDTH
• Each thread computes one element of Pdsub
m
kbx
by
k
m
![Page 77: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/77.jpg)
77
G80 Shared Memory and Threading• Each SM in G80 has 16KB shared memory
SM size is implementation dependent! For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of
shared memory. Can potentially have up to 8 Thread Blocks actively executing
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block, allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16 The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
![Page 78: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/78.jpg)
Tiling Size EffectsG
FLO
PS
0
10
20
30
40
50
60
70
80
90
100tile
don
ly
tiled
&un
rolle
d
tiled
only
tiled
&un
rolle
d
tiled
only
tiled
&un
rolle
d
tiled
only
tiled
&un
rolle
d
not tiled 4x4 tiles 8x8 tiles 12x12 tiles 16x16 tiles
![Page 79: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/79.jpg)
• Global variables declaration __host__ __device__... __global__, __constant__, __texture__
• Function prototypes __global__ void kernelOne(…) float handyFunction(…)
• Main () allocate memory space on the device – cudaMalloc(&d_GlblVarPtr, bytes ) transfer data from host to device – cudaMemCpy(d_GlblVarPtr, h_Gl…) execution configuration setup kernel call – kernelOne<<<execution configuration>>>( args… ); transfer results from device to host – cudaMemCpy(h_GlblVarPtr,…) optional: compare against golden (host computed) solution
• Kernel – void kernelOne(type args,…) variables declaration - __local__, __shared__
automatic variables transparently assigned to registers or local memory syncthreads()…
• Other functions float handyFunction(int inVar…);
Typical Structure of a CUDA Program
repeatas needed
![Page 80: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/80.jpg)
GPU for Machine learning
![Page 81: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/81.jpg)
Machine learning
• With improved sensors, the amount of data availablehas increased by several folds over the past decade.
• Also, more robust and sophisticated learningalgorithms have been developed to extractmeaningful information from the data
• This has resulted in the application of thesealgorithms in many areas: Geostatistics, astronomical predictions, weather data
assimilations, computational finance.
![Page 82: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/82.jpg)
Extracting information from the data• “Extracting information from the data” means converting the
raw data to an interpretable version For example, given a face image, it would desirable to
extract the identity of the person, the face pose, etc• Information extraction categories Regression – [fitting a continuous function] Classification – [classify into one of the predefined classes] Density estimation – [evaluating the class membership] Ranking – [preference relationships between classes]
• Bottom-line: Infer the relationships based on the data Build the relationship model from the data
![Page 83: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/83.jpg)
Relationship modeling• There are two primary categories of the models Parametric Non-parametric
• Parametric model: Assumes a known parametric form of the “relationship” Estimates the parameters of this “form” from the data
• Non-parametric model Do not make any assumptions on the form of the
underlying function. “Letting the data speak for itself”
![Page 84: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/84.jpg)
Kernel methods• A class of robust non-parametric learning methods• Projects the data into a higher dimensional space• Formulates the problem such that only the inner product of the higher
dimension features are required• The inner-products are given by the kernel functions• For example the Gaussian kernel is given by:
![Page 85: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/85.jpg)
Scalable learning methods• Most of these kernel based learning approaches scale O(N2) or O(N3) in
time with respect to data
• There is also O(N2) memory requirement in many of these• This is undesirable for very large datasets• We would like to develop a parallelized version on GPU
![Page 86: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/86.jpg)
Kernel methods on GPUs• There are problems where summations of kernel
functions need to be evaluated Algorithm must map summation to multiple threads
• Some problems require the solution of linear systeminvolving kernel matrices Possibly use the kernel summation before with popular
iterative approaches like conjugate gradient• There also exist problems where popular matrix
decompositions like LU needed to performed forkernel matrices Number of approaches already exist on GPUs
![Page 87: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/87.jpg)
Solving Ky=b
• Can use iterative methods Conjugate Gradient
• Over each iteration, to evaluate Kx
• We will discuss the matrix-vector product now
![Page 88: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/88.jpg)
Kernel matrix – special structure
• O(N) dependence N x N matrix depends on O(N)-length vector
• Need only O(N) space
• Need to exploit this to minimize space requirements
![Page 89: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/89.jpg)
Kernel summation on GPU• Data:
Source points xi, i=1,…,N, Evaluation points yj, j=1,…,M $
• Each thread evaluates the sum corresponding to one evaluation point:• Algorithm:
1. Load evaluation point corresponding to the current thread in to alocalregister.
2. Load the first chunk of source data to the shared memory.3. Evaluate part of kernel sum corresponding to sourcedata in the shared
memory.4. Store the result in a local register.5. If all the source points have not been processed yet, load the next chunk,
go to Step 3.6. Write the sum in the local register to the global memory.
![Page 90: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/90.jpg)
Gaussian kernel on GPUfloat sum=0.0; __shared__ float hs[DIM]; volatile float yr[DIM];
for (int k=0;k<DIM;k++){
int indexM=k+(blockIdx.x*BLOCK_SIZE + threadIdx.x)*DIM;
yr[k]=y[indexM];}
// load ‘h’
for (int b=0;b<N;b+=BLOCK_SIZE){
__shared__ float qs[BLOCK_SIZE];
__shared__ float Xs[BLOCK_SIZE][DIM];
// load X & q
for (int i=0;i<BLOCK_SIZE;i++){
float dist=0.0;
if ((b+i)<N){
for (int k=0;k<DIM;k++){
float tempDiff=yr[k]-Xs[i][k];
dist+=(tempDiff*tempDiff)/(hs[k]*hs[k]);
}
sum+=__expf(-dist)*qs[i];
}}
__syncthreads();
}
if ((blockIdx.x*BLOCK_SIZE + threadIdx.x)<M)
f[blockIdx.x*BLOCK_SIZE + threadIdx.x] = sum;
}
![Page 91: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/91.jpg)
Kernels tested:
Gaussian
Matern
Periodic
Epanechnikov
![Page 92: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/92.jpg)
Raw speedups across dimension
![Page 93: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/93.jpg)
Applications
• Kernel density estimation• Gaussian process regression• Meanshift clustering• Ranking• And many more…
![Page 94: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/94.jpg)
Kernel Density Estimation
• Non-parametric way of estimating probability densityfunction of a random variable
• Two popular kernels: Gaussian and Epanechnikov
• Accelerated with GPU based algorithm: Speed up ~ 450X
![Page 95: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/95.jpg)
Results on standard distributions• Performed KDE on 15 normal mixture densities from [1]
[1] J. S. Marron and M. P. Wand 'Exact Mean Integrated SquaredError' The Annals of Statistics, 1992, Vol. 20, No. 2, 712-736
![Page 96: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/96.jpg)
Gaussian Process Regression• Non-parametric regression
Kernel regression Robust in non-linear modeling
• For ; given y and x, need to model ‘f’• Given
• Data D = {xi, yi}, i=1..N• Test point x*, need to find f(x*) or f*
• GPR model f*=k*(x) x (K+σ2I)-1 x y K = kernel matrix of training data k* = kernel vector of test point w.r.t all training data
![Page 97: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/97.jpg)
Gaussian Process Regression
• GPR model f*=k*(x) x (K+σ2I)-1 x y• Complexity – O(N3) due to the inversion of the kernel
matrix (can be made O(N2) using Conjugate Gradient A popular iterative krylov algorithm
• Popular kernels that are used in GPR are: Gaussian Matern Periodic
![Page 98: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/98.jpg)
Gaussian Process Regression
102
103
104
105
10-2
10-1
100
101
102
Performance of Gaussian Proces Regression with Gaussian kernel
Size of data
Tim
e ta
ken
CPUGPU
![Page 99: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/99.jpg)
GPR on standard datasets
![Page 100: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/100.jpg)
GPU based kernel summation
• Still O(N2)! • A linear approximation algorithm can beat this
beyond “some” N• FMM – based Gaussian kernel (FIGTREE) vs GPU
version
![Page 101: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/101.jpg)
FIGTREE vs GPU 1
![Page 102: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/102.jpg)
FIGTREE vs GPU 2
![Page 103: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/103.jpg)
FIGTREE vs GPU 3
![Page 104: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/104.jpg)
Further :
• More interesting: FMM on GPU• Issues on data structures• Need to consider many factors.
• Will be discussed next class by Dr. Nail Gumerov
![Page 105: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/105.jpg)
Quotes
![Page 106: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/106.jpg)
GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems.
Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.
Jack DongarraProfessor, University of TennesseeAuthor of Linpack
![Page 107: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/107.jpg)
We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.
Heterogeneous computing is what makes such a breakthrough possible.
Burton SmithTechnical Fellow, MicrosoftFormerly, Chief Scientist at Cray
![Page 108: The “New” Moore’s Laramani/cmsc662/GPU_November_10.pdf · • You must re-think your algorithms to be parallel ! • Data-parallel computing is most scalable solution. ... on](https://reader033.vdocuments.us/reader033/viewer/2022050111/5f4845e767a88c4dc539a4e9/html5/thumbnails/108.jpg)