![Page 1: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/1.jpg)
ACACES 2018 Summer School
GPU Architectures: Basic to Advanced Concepts
Adwait Jog, Assistant Professor
College of William & Mary (http://adwaitjog.github.io/)
![Page 2: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/2.jpg)
Course Outline
q Lectures 1 and 2: Basics Concepts● Basics of GPU Programming● Basics of GPU Architecture
q Lecture 3: GPU Performance Bottlenecks● Memory Bottlenecks● Compute Bottlenecks ● Possible Software and Hardware Solutions
q Lecture 4: GPU Security Concerns● Timing channels● Possible Software and Hardware Solutions
![Page 3: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/3.jpg)
Streaming Multi-Processor (SM)
Scratchpad
ControlSM
GPU-1 (Device - 1)SM SM SM SM
PE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE
PE PEPE PEPE PEPE PEPE PEPE PEPE PEPE PE
Register File
– Threads are assigned to SM in block granularity
– SM maintains thread/block idx #s– SM manages/schedules thread execution– Multiple blocks can be allocated to the SM
– Based on the amount of resources (shared memory, register file etc.)
![Page 4: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/4.jpg)
GPU Execution Model qBlocks assigned to each SM are scheduled on the
associated SIMD hardware (i.e., on the Processing Elements (PEs) or CUDA Cores).
qSM bundles threads (from various blocks) into warps (wavefronts) and runs them in lockstep on across PEs.
qAn NVIDIA warp groups 32 consecutive threads together (AMD wave-fronts group 64 threads together)
qWarps are:● Scheduling units in SM● Scheduled in multiplexed and pipelined manner
on the SM
![Page 5: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/5.jpg)
q Execution in an SM
W1
W2
W3 Computation
Waiting for Data from GPU Memory
GPU attempts to hide long memory latency with computation from
other warps
Tolerating Long Latencies
How SMs are able to context switch between warps so quickly?
![Page 6: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/6.jpg)
Key Points so Far
qProgrammer organize threads into “blocks” (up to 1024 threads per block)
qMotivation: Write parallel software once and run on future hardware
qHardware spawns more threads/warps than GPU can run (some may wait)
qWarps associated with blocks can help in tolerating long latencies.
qGPUs support large register files (for fast context switching) and high bandwidth memories (for providing data to large number of concurrent threads)
![Page 7: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/7.jpg)
GPU Architecture Overview
Single-Instruction, Multiple-Threads
GPU
Interconnection Network
SM Cluster
SM SM
MemoryPartition
GDDR5/HBM
MemoryPartition
GDDR5/HBM
MemoryPartition
GDDR5/HBM Off-chip DRAM
SM Cluster
SM SM
SM Cluster
SM SM
![Page 8: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/8.jpg)
GPU Microarchitecture
qNot many details are publicly available about GPU microarchitecture.
qModel described next, embodied in GPGPU-Sim, developed from: white papers, programming manuals, IEEE Micro articles, patents.
![Page 9: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/9.jpg)
GPGPU-Sim from UBC – A Cycle-level Simulator
Correlation~0.976
12
0.00
50.00
100.00
150.00
200.00
250.00
0.00 50.00 100.00 150.00 200.00 250.00
GPG
PU-S
im IP
C
Quadro FX5800 IPC
HW - GPGPU-Sim Comparison
![Page 10: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/10.jpg)
GPU Instruction Set Architecture (ISA)
q NVIDIA defines a virtual ISA, called “PTX” (Parallel Thread eXecution)
q More recently, Heterogeneous System Architecture (HSA) Foundation (AMD, ARM, Imagination, Mediatek, Samsung, Qualcomm, TI) defined the HSAIL virtual ISA.
q PTX is Reduced Instruction Set Architecture (e.g., load/store architecture)
q Virtual: infinite set of registers (much like a compiler intermediate representation)
q PTX translated to hardware ISA by backend compiler (“ptxas”). Either at compile time (nvcc) or at runtime (GPU driver).
![Page 11: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/11.jpg)
Some Example PTX Syntax
q Registers declared with a type:
.reg .pred p, q, r;
.reg .u16 r1, r2;
.reg .f64 f1, f2;
q ALU operations
add.u32 x, y, z; // x = y + z
mad.lo.s32 d, a, b, c; // d = a*b + c
q Memory operations:
ld.global.f32 f, [a];
ld.shared.u32 g, [b];
st.local.f64 [c], h
q Compare and branch operations:
setp.eq.f32 p, y, 0; // is y equal to zero?
@p bra L1 // branch to L1 if y equal to zero
![Page 12: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/12.jpg)
Inside an SM (1)
q Fine-grained multithreading● Interleave warp execution to hide latency● Register values of all threads stays in core
SIMTFront End SIMD Datapath
FetchDecode
ScheduleBranch
Done (Warp ID)
Memory Subsystem Icnt.NetworkSMem L1 D$ Tex $ Const$
RegFile
![Page 13: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/13.jpg)
Inside an SM (2)Schedule+ Fetch Decode Register
Read Execute Memory Writeback
SIMT Front End SIMD Datapath
ALUALUALU
I-Cache DecodeI-Buffer
ScoreBoard
Issue OperandCollector
MEM
ALUFetch SIMT-Stack
Done (WID)
Valid[1:N]
Branch Target PC
Pred.ActiveMask
q Three decoupled warp schedulers
q Scoreboard
q Large register file
q Multiple SIMD functional units
Scheduler 1
Scheduler 2
Scheduler 3
![Page 14: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/14.jpg)
Fetch + Decode
qArbitrate the I-cache among warps● Cache miss handled by
fetching again later
qFetched instruction is decoded and then stored in the I-Buffer● 1 or more entries / warp● Only warp with vacant
entries are considered in fetch
Inst. W1 rInst. W2Inst. W3
vrvrv
ToFetch
Issue
DecodeScore-Board
IssueARB
PC1
PC2
PC3
ARB
SelectionToI-C
ache
Valid[1:N]
I-Cache DecodeI-Buffer
FetchValid[1:N]
![Page 15: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/15.jpg)
Instruction IssueqSelect a warp and issue an instruction from its
I-Buffer for execution● Scheduling: Greedy-Then-Oldest (GTO)● GT200/later Fermi/Kepler:
Allow dual issue (superscalar)● Fermi: Odd/Even scheduler● To avoid stalling pipeline might
keep instruction in I-buffer untilknow it can complete (replay)
Inst. W1 rInst. W2Inst. W3
vrvrv
ToFetch
Issue
DecodeScore-Board
IssueARB
![Page 16: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/16.jpg)
Scoreboard
qChecks for RAW and WAW dependency hazard●Flag instructions with hazards as not ready in
I-Buffer (masking them out from the scheduler)
qInstructions reserves registers at issue
qRelease them at writeback
![Page 17: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/17.jpg)
Operand Collector
(from instruction issue stage)dispatch
![Page 18: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/18.jpg)
ALU Pipelines
qSIMD Execution Unit
qFully Pipelined
qEach pipe may execute a subset of instructions
qConfigurable bandwidth and latency (depending on the instruction)
qDefault: SP + SFU pipes
![Page 19: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/19.jpg)
Memory Unit
qModel timing for memory instructions
qSupport half-warp (16 threads) ● Double clock the unit● Each cycle service half
the warp
qHas a private writebackpath
AccessCoalesc.A
GU
SharedMem
BankConflict
Const.Cache
TextureCache
DataCache
Mem
ory
Port
MSHR
![Page 20: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/20.jpg)
Writeback
qEach pipeline has a result bus for writeback
qException: ●SP and SFU pipe shares a result bus●Time slots on the shared bus is pre-
allocated
![Page 21: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/21.jpg)
SM Cluster
q Collection of SIMT cores
![Page 22: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/22.jpg)
GPU Architecture Overview
Single-Instruction, Multiple-Threads
GPU
Interconnection Network
SM Cluster
SM SM
MemoryPartition
GDDR5/HBM
MemoryPartition
GDDR5/HBM
MemoryPartition
GDDR5/HBM Off-chip DRAM
SM Cluster
SM SM
SM Cluster
SM SM
![Page 23: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/23.jpg)
Interconnection Network Model
q Intersim (Booksim) a flit level simulator ● Topologies (Mesh, Torus, Butterfly, …)● Routing (Dimension Order, Adaptive, etc. )● Flow Control (Virtual Channels, Credits)
q Two separate networks● From SIMT cores to memory partitions
- Read Requests, Write Requests
● From memory partitions to SIMT cores- Read Replies, Write Acks
![Page 24: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/24.jpg)
Topology Examples
![Page 25: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/25.jpg)
GPU Architecture Overview
Single-Instruction, Multiple-Threads
GPU
Interconnection Network
SM Cluster
SM SM
MemoryPartition
GDDR3/GDDR5
MemoryPartition
GDDR3/GDDR5
MemoryPartition
GDDR3/GDDR5 Off-chip DRAM
SM Cluster
SM SM
SM Cluster
SM SM
![Page 26: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/26.jpg)
Memory Partition
Interconnection Netw
ork
Memory Partition
L2 CacheBank
DRAMAccess
Scheduler
Off-Chip DRAM
Channel
DRAMTimingModel
AtomicOperationExecution
ROP QueueICNTàL2 Queue L2àDRAM Queue DRAM Latency
Queue
DRAMàL2 QueueL2àICNT Queue
![Page 27: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/27.jpg)
DRAM
Column Decoder
MemoryArray
Row
Dec
oderM
emor
y Co
ntro
ller
Row BufferRow Buffer
Row
Dec
oder
Column Decoder
Row Buffer
Column Decoder
Row Buffer
DRAM Access
• Row access – Activate a row or page of
a DRAM bank– Load it to row buffer
• Column access– Select and return a block
of data in row buffer• Precharge
– Write back the opened row into DRAM
– Otherwise it will be lost!
![Page 28: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/28.jpg)
DRAM Row Access Locality
Row Buffer
DRAM Bank
Rows
tRC = row cycle time
tRP = row precharge time
tRCD = row activate time
Bank Precharge Row A Activate Row B Pre...RB RBRARARARA Precharge Row B Act..tRP tRCD
tRC
![Page 29: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/29.jpg)
DRAM Bank-level Parallelism
• To increase DRAM performance and utilization• Multiple banks per
DRAM chip• To increase bus width
• Multiple chips per Memory Controller
![Page 30: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/30.jpg)
Scheduling DRAM Requests
• Scheduling policies supported• First in first out (FIFO)
• In-order scheduling• First Ready First Come First Serve (FR-FCFS)
• Out of order scheduling• Requires associative search
![Page 31: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/31.jpg)
Key GPU Performance Concerns
![Page 32: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/32.jpg)
Key GPU Performance Concerns
– I) Data transfers between CPU and GPU are one of the major performance bottlenecks.
– II) Data transfers between SMs and global memory is costly. Can on-chip memory help?
GPU Memory
Cache
ALUControl
ALU
ALU
ALU
CPU Memory
Bottleneck!
GPU
(Dev
ice)
Scra
tchp
ad
Regi
ster
s an
dLo
cal M
emor
y
GPU
Glo
bal M
emor
y
Bottleneck!??
SMSM
SMSM
![Page 33: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/33.jpg)
CUDA Streams
q CUDA (and OpenCL) provide the capability to overlap computation on GPU with memory transfers using “Streams” (Command Queues)
q A Stream orders a sequence of kernels and memory copy “operations”.
q Operations in one stream can overlap with operations in a different stream.
![Page 34: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/34.jpg)
How Can Streams Help?
Serial:
Streams:
cudaMemcpy(H2D) kernel<<<>>> cudaMemcpy(D2H)
cudaMemcpy(H2D) K0 DH0
K1 DH1
K2 DH2
Time
Savings
GPU idle GPU idle GPU busy
![Page 35: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/35.jpg)
CUDA Streams
cudaStream_t streams[3];
for(i=0; i<3; i++)
cudaStreamCreate(&streams[i]); // initialize streams
for(i=0; i<3; i++) {
cudaMemcpyAsync(pD+i*size,pH+i*size,size,
cudaMemcpyHostToDevice,stream[i]); // H2D
MyKernel<<<grid,block,0,stream[i]>>>(pD+i,size); // compute
cudaMemcpyAsync(pD+i*size,pH+i*size,size,
cudaMemcpyDeviceToHost,stream[i]); // D2H
}
![Page 36: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/36.jpg)
Manual CPU ó GPU Data Movementq Problem #1: Programmer needs to
identify data needed in a kernel and insert calls to move it to GPU
q Problem #2: Pointer on CPU does not work on GPU since different address spaces
q Problem #3: Bandwidth connecting CPU and GPU is order of magnitude smaller than GPU off-chip
q Problem #4: Latency to transfer data from CPU to GPU is order of magnitude higher than GPU off-chip
q Problem #5: Size of GPU DRAM memory much smaller than size of CPU main memory
![Page 37: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/37.jpg)
Additional Features in CUDA
q Dynamic Parallelism (CUDA 5 onwards): Launch kernels from within a kernel. Reduce work for e.g., adaptive mesh refinement.
q Unified Memory (CUDA 6 onwards): Avoid need for explicit memory copies between CPU and GPU
http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
See also, Gelado, et al. ASPLOS 2010.
![Page 38: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/38.jpg)
Key GPU Performance Concerns
– I) Data transfers between CPU and GPU are one of the major performance bottlenecks.
– II) Data transfers between SMs and global memory are costly. Can on-chip memory help?
GPU Memory
Cache
ALUControl
ALU
ALU
ALU
CPU Memory
Bottleneck!
GPU
(Dev
ice)
Scra
tchp
ad
Regi
ster
s an
dLo
cal M
emor
y
GPU
Glo
bal M
emor
y
Bottleneck!
![Page 39: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/39.jpg)
Let’s consider some software approaches first..before moving on to hardware approaches
![Page 40: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/40.jpg)
Background: GPU Memory Address Spaces
q GPU has three address spaces to support increasing visibility of data between threads: local, shared, global
q In addition two more (read-only) address spaces: Constant and texture.
![Page 41: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/41.jpg)
Partial Overview of CUDA Memories
– Device code can:– R/W per-thread registers– R/W all-shared global memory
– Host code can– Transfer data to/from per grid global
memory
Host
SM
GPU Global Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
![Page 42: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/42.jpg)
CUDA Device Memory Management API functions
– cudaMalloc()– Allocates an object in the device global memory– Two parameters
– Address of a pointer to the allocated object– Size of allocated object in terms of bytes
– cudaFree()– Frees object from device global memory– One parameter
– Pointer to freed object
Host
SM
GPU Global Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
![Page 43: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/43.jpg)
Host-Device Data Transfer API functions
– cudaMemcpy()– memory data transfer– Requires four parameters
– Pointer to destination – Pointer to source– Number of bytes copied– Type/Direction of transfer
Host
SM
GPU Global Memory
Block (0, 0)
Thread (0, 0)
Registers
Block (0, 1)
Thread (0, 0)
Registers
Thread (0, 1)
Registers
Thread (0, 1)
Registers
Relatively new Features:
Transfer to device can be asynchronous
Explicit mention of memcpy by the users can be avoided by new CUDA Unified Memory
https://devblogs.nvidia.com/unified-memory-cuda-beginners/
![Page 44: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/44.jpg)
Local address Space
Each thread has own “local memory”
0x42
Example: Location at address 100 for thread 0 is different from location at address 100 for thread 1.
Contains local variables private to a thread.
![Page 45: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/45.jpg)
Global Address Spaces
thread block X
thread block Y
Each thread in the different thread blocks (even from different kernels) can access a region called “global memory”.
Commonly in GPGPU workloads threads write their own portion of global memory. Avoids need for synchronization—slow; also unpredictable thread block scheduling.
0x42
![Page 46: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/46.jpg)
Blocks are partitioned after linearization
– Linearized thread blocks are partitioned – Thread indices within a warp are consecutive and increasing– Warp 0 starts with Thread 0
– Partitioning scheme is consistent across devices– Thus you can use this knowledge in control flow– However, the exact size of warps may change from
generation to generation
– DO NOT rely on any ordering within or between warps– If there are any dependencies between threads, you must
__syncthreads() to get correct results.
![Page 47: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/47.jpg)
Warps in Multi-dimensional Thread Blocks
– The thread blocks are first linearized into 1D in row major order
– In x-dimension first, y-dimension next, and z-dimension last
![Page 48: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/48.jpg)
Reminder: Kernel, Blocks, Threads
![Page 49: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/49.jpg)
“Coalescing” global accesses
qAligned accesses request single 128B cache blk
qMemory Divergence:
ld.global r1,0(r2)
128 255
128 256 1024 1152
ld.global r1,0(r2)
![Page 50: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/50.jpg)
Example: Transpose (CUDA SDK)
__global__ void transposeNaive(float *odata, float* idata, int width)
{
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; // TILE_DIM = 16
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex;
int index_out = yIndex + width * xIndex;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { // BLOCK_ROWS = 16
odata[index_out+i] = idata[index_in+i*width];
}
}
NOTE: “xIndex”, “yIndex”, “index_in”, “index_out”, and “i” are in local memory (local variables are register allocated but stack lives in local memory)
“odata” and “idata” are pointers to global memory(both allocated using calls to cudaMalloc -- not shown above)
1 23 4
1 32 4
Write to global memory highlighted above is not “coalesced”.
![Page 51: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/51.jpg)
Scratchpad Memory
Each thread in the same thread block (work group) can access a memory region called scratchpad (or shared memory)
Shared memory address space is limited in size (16 to 48 KB).
Used as a software managed “cache” to avoid off-chip memory accesses.
Synchronize threads in a thread block using __syncthreads();
thread block
0x42
![Page 52: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/52.jpg)
Optimizing Transpose for Coalescing
1 23 4
idata
odata
1 23 4
1 23 4
Step 1: Read block of data into shared memory
Step 2: Copy from shared memory into global memory using coalesce write
1 32 4
![Page 53: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/53.jpg)
Use of Scratchpad__global__ void transposescratchpad (float *odata, float *idata, int width)
{
__shared__ float tile[TILE_DIM][TILE_DIM];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*width;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}
__syncthreads();
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i*width] = tile[threadIdx.x][threadIdx.y+i];
}
}https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
GOOD: Coalesced write BAD: Shared memory bank conflicts
![Page 54: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/54.jpg)
Bank Conflicts
qTo increase bandwidth common to organize memory into multiple banks.
q Independent accesses to different banks can proceed in parallel
Bank 0 Bank 1
0246
1357
Example 1: Read 0, Read 1(can proceed in parallel)
Example 2: Read 0, Read 3(can proceed in parallel)
Bank 0 Bank 1
0246
1357
Example 3: Read 0, Read 2(bank conflict)
Bank 0 Bank 1
0246
1357
![Page 55: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/55.jpg)
Shared Memory Bank Conflicts
__shared__ int A[BSIZE];
…
A[threadIdx.x] = … // no conflicts
0326496
1336597
316395127
2346698
![Page 56: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/56.jpg)
Shared Memory Bank Conflicts
__shared__ int A[BSIZE];
…
A[2*threadIdx.x] = // 2-way conflict
0326496
1336597
316395127
2346698
![Page 57: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/57.jpg)
Optimizing Transpose for Coalescing
1 23 4
idata
odata
1 23 4
1 23 4
Step 1: Read block of data into shared memory
Step 2: Copy from shared memory into global memory using coalesce write
1 32 4
Problem: Access two locations in sameshared memory bank.
![Page 58: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/58.jpg)
Eliminate Bank Conflicts__global__ void transposeNoBankConflicts (float *odata, float *idata, int width)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;
yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)* width;
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width];
}
__syncthreads();
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) {
odata[index_out+i*width] = tile[threadIdx.x][threadIdx.y+i];
}
}
https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
![Page 59: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/59.jpg)
Optimizing Transpose for Coalescing
1 23 4
idata
odata
1 23
4
Step 1: Read block of data into shared memory
Step 2: Copy from shared memory into global memory using coalesce write
1 32 4
1 23
4
Bank 0 Bank 1
Bank 0 Bank 1
![Page 60: ACACES 2018 Summer School GPU Architectures: …adwaitjog.github.io/teach/acaces2018/acaces-2018-slides...GPU Instruction Set Architecture (ISA) qNVIDIA defines a virtual ISA, called](https://reader030.vdocuments.us/reader030/viewer/2022040915/5e8c6921126f011b1a710a47/html5/thumbnails/60.jpg)
Reading Materialq NVIDIA Blogs:
● https://devblogs.nvidia.com/how-access-global-memory-efficiently-cuda-c-kernels/
● https://devblogs.nvidia.com/efficient-matrix-transpose-cuda-cc/
q GPGPU-sim Manual and Tutorial Slides● http://www.gpgpu-sim.org/manual● http://www.gpgpu-sim.org/micro2012-tutorial/
q More background material: Jog et al., OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU performance, ASPLOS’13