cache conscious wavefront scheduling t. rogers, m o’conner ... · due to thrashing figure from t....
TRANSCRIPT
1
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Cache Conscious Wavefront Scheduling T. Rogers, M O’Conner, and T. Aamodt
MICRO 2012
(2)
Goal
• Understand the relationship between schedulers (warp/wavefront) and locality behaviors v Distinguish between inter-wavefront and intra-wavefront
locality
• Design a scheduler to match #scheduled wavefronts with the L1 cache size v Working set of the wavefronts fits in the cache v Emphasis on intra-wavefront locality
2
(3)
Reference Locality
Intra-thread locality
Inter-thread locality
Intra-wavefront locality Inter-wavefront
locality
• Scheduler decisions can affect locality
• Need non-oblivious (to locality) schedulers
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(4)
Key Idea
Executing Wavefronts Ready Wavefronts
Issue?
• Impact of issuing new wavefronts on the intra-warp locality of executing wavefronts v Footprint in the cache v When will a new wavefront cause thrashing?
Intra-wavefront footprints in the cache
3
(5)
Concurrency vs. Cache Behavior
Round Robin Scheduler
Less than max concurrency
Adding more wavefronts to hide latency traded off
against creating more long latency memory references
due to thrashing
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(6)
Additions to the GPU Microarchitecture
Control issue of wavefronts
I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
pending warps Keep track of locality behavior
on a per wavefront basis
Feedback
4
(7)
Intra-Wavefront Locality
• Scalar thread traverses edges
• Edges stored in successive memory locations
• One thread vs. many
• Intra-thread locality leads to intra-wavefront locality
• #wavefronts determined by its reference footprint and cache size
• Scheduler attempts to find the right #wavefronts
(8)
Static Wavefront Limiting I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
pending warps
• Limit the #wavefronts based on working set and cache size
• Based on profiling?
• Current schemes allocate #wavefronts based on resource consumption not effectiveness of utilization
• Seek to shorten the re-reference interval
5
(9)
Cache Conscious Wavefront Scheduling
• Basic Steps v Keep track of lost locality of a
wavefront v Control number of wavefronts
that can be issued
• Approach v List of wavefronts sorted by lost
locality score v Stop issuing wavefronts with
least lost locality
I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
pending warps
(10)
Keeping Track of Locality I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
pending warps
Keep track of associated Wavefronts
Indexed by WID Victims indicate “lost locality
Update lost locality score
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
6
(11)
Updating Estimates of Locality I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
pending warps
Wavefronts sorted by lost locality score
Update wavefront lost locality score on victim hit
Note which are allowed to issue
Changes based on
#wavefronts
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
(12)
Estimating the Feedback Gain
Drop the scores – no VTA hits
VYA hit – increase score
Prevent issue
LLDS = VTAHitsTotalInsIssuedTotal
.KThrottle.CumLLSCutoff Scale gain based on percentage of cutoff
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
7
(13)
Schedulers
• Loose Round Robin
• Greedy-Then-Oldest (GTO) v Execute one wavefront until stall and then oldest
• 2LVL-GTO v Two level scheduler with GTO instead of RR
• Best SWL
• CCWS
(14)
Performance
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
8
(15)
Some General Observations
• Performance largely determined by v Emphasis on oldest wavefronts v Distribution of references across cache lines – extent of lack
of coalescing
• GTO works well for this reason à prioritizes older wavefronts
• LRR touches too much data (too little reuse) to fit in the cache
(16)
Locality Behaviors
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Cache Conscious Wavefront Scheduling,” MICRO 2012
9
(17)
Summary
• Dynamic tracking of relationship between wavefronts and working sets in the cache
• Modify scheduling decisions to minimize interference in the cache
• Tunables: need profile information to create stable operation of the feedback control
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
Divergence Aware Warp Scheduling T. Rogers, M O’Conner, and T. Aamodt
MICRO 2013
10
(19)
Goal
• Understand the relationship between schedulers (warp/wavefront) and locality behaviors v Distinguish between inter-wavefront and intra-wavefront
locality
• Design a scheduler to match #scheduled wavefronts with the L1 cache size v Working set of the wavefronts fits in the cache v Emphasis on intra-wavefront locality
• Differs from CCWS in being proactive v Deeper look at what happens inside loops
(20)
Key Idea
• Manage the relationship between control divergence, memory divergence and scheduling
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
11
(21)
Key Idea (2)
while (i <C[tid+1])
warp 0 warp 1
Fill the cache with 4 references – delay warp 1 Divergent
branch
Intra-thread locality
Available room in the cache,
schedule warp 1
Use warp 0 behavior to predict interference due to warp 1
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(22)
Goal Simpler portable version GPU-Optimized Version
Make the performance
equivalent
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
12
(23)
Additions to the GPU Microarchitecture
Control issue of wavefronts
I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer/Sboard
pending warps
Memory Coalescer
(24)
Observation
• Bulk of the accesses in a loop come from a few static load instructions
• Bulk of the locality in (these) applications is intra-loop
Loops
13
(25)
Distribution of Locality
Bulk of the locality comes form a few static loads in loops
Hint: Can we keep data from last iteration?
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(26)
A Solution
• Prediction mechanisms for locality across iterations of a loop
• Schedule such that data fetched in one iteration is still present at next iteration
• Combine with control flow divergence (how much of the footprint needs to be in the cache?)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
14
(27)
Classification of Dynamic Loads
• Group static loads into equivalence classes à reference the same cache line
• Identify these groups by repetition ID • Prediction for each load by compiler or hardware
converged
Diverged
Somewhat diverged
Control flow divergence
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(28)
Predicting a Warp’s Cache Footprint • Entering loop body • Create footprint prediction
Warp
• Exit loop • Reinitialize prediction
• Some threads exit the loop • Predicted footprint drops
Warp
• Predict locality usage of static loads v Not all loads increase the footprint
• Combine with control divergence to predict footprint • Use footprint to throttle/not-throttle warp issue
Taken
Not Taken
15
(29)
Principles of Operation
• Prefix sum of each warp’s cache footprint used to select warps that can be issued EffCacheSize = kAssocFactor.TotalNumLines
• Scaling back from a fully associative cache • Empirically determined
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(30)
Principles of Operation (2)
• Profile static load instructions v Are they divergent? v Loop repetition ID
o Assume all loads with same base address and offset within cache line access are repeated each iteration
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
16
(31)
Prediction Mechanisms
• Profiled Divergence Aware Scheduling (DAWS) v Used offline profile results to dynamically determine de-
scheduling decisions
• Detected Divergence Aware Scheduling (DAWS) v Behaviors derived at run-time to drive de-scheduling
decisions o Loops that exhibit intra-warp locality o Static loads are characterized as divergent or convergent
(32)
Extensions for DAWS I-Fetch
Decode
RF PRF
D-Cache Data All Hit?
Writeback
scalar Pipeline
scalar pipeline
scalar pipeline
Issue
I-Buffer
+
+
17
(33)
Operation: Tracking
Basis for throttling
Profile-based information
One entry per warp issue
slot
Created/removed at loop begin/
end
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
#Active Lanes
(34)
Operation: Prediction
Sum results from static loads in this loop
- Add #active-lanes of cache lines for divergent loads
- Add 2 for converged loads - Count loads in the same
equivalence class only once (unless divergent)
• Generally only considering de-scheduling warps in loops v Since most of the activity is here
• Can be extended to non-loop regions by associating non-loop code with next loop
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
18
(35)
Operation: Nested Loops
for (i=0; i<limitx; i++)}{ .. ..
for (j=0: j<limity; j++){ .. .. }
..
.. }
• On-entry update prediction to that of inner loop
• On re-entry predict based inner loop predictions
On-exit, do not clear prediction
De-scheduling of warps determined by inner loop behaviors!
• Re-used predictions based on inner-most loops which is where most of the date re-use is found
(36)
Detected DAWS: Prediction Sampling warp for
the loop (>2 active threads)
• Detect both memory divergence and intra-loop repetition at run time
• Fill PCLoad entries based on run time information • Use profile information to start
if locality
enter load
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
19
(37)
Detected DAWS: Classification
Increment or decrement the counter depending on #memory accesses for a load
Create equivalence classes of loads (checking the PCs)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(38)
Performance
Little to no degradation
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
20
(39)
Performance
Significant intra-warp locality
SPMV-scalar normalized to best
SPMV-vector
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(40)
Summary
• If we can characterize warp level memory reference locality, we can use this information to minimize interference in the cache through scheduling constraints
• Proactive scheme outperforms reactive management
• Understand interactions between memory divergence and control divergence
21
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving
GPGPU Performance A. Jog et. al ASPLOS 2013
(42)
Goal
• Understand memory effects of scheduling from deeper within the memory hierarchy
• Minimize idle cycles induced by stalling warps waiting on memory references
22
Off-‐chip Bandwidth is Cri2cal!
43
Percentage of total execution cycles wasted waiting for the data to come back from DRAM
0%
20%
40%
60%
80%
100% SA
D
PVC
SS
C
BFS
M
UM
C
FD
KM
N
SCP
FWT IIX
SP
MV
JPEG
B
FSR
SC
FF
T SD
2 W
P PV
R
BP
CO
N
AES
SD
1 B
LK
HS
SLA
DN
LP
S N
N
PFN
LY
TE
LUD
M
M
STO
C
P N
QU
C
UTP
H
W
TPA
F
AVG
AV
G-T
1
Type-1 Applications
55% AVG: 32%
Type-2 Applications
GPGPU Applications
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(44)
Source of Idle Cycles
• Warps stalled on waiting for memory reference v Cache miss v Service at the memory controller v Row buffer miss in DRAM v Latency in the network (not addressed in this paper)
• The last warp effect
• The last CTA effect
• Lack of multiprogrammed execution v One (small) kernel at a time
23
(45)
Impact of Idle Cycles
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
High-‐Level View of a GPU
DRAM
SIMT Cores
…
Scheduler
ALUs L1 Caches
Threads
… W W W W W W
Warps
L2 cache
Interconnect
CTA CTA CTA CTA
Cooperative Thread Arrays (CTAs)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
24
Warp Scheduler
ALUs L1 Caches
CTA-‐Assignment Policy (Example)
47
Warp Scheduler
ALUs L1 Caches
Multi-threaded CUDA Kernel
SIMT Core-1 SIMT Core-2
CTA-1 CTA-2 CTA-3 CTA-4
CTA-2 CTA-4 CTA-1 CTA-3
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(48)
Organizing CTAs Into Groups
• Set minimum number of warps equal to #pipeline stages v Same philosophy as the two-level warp scheduler
• Use same CTA grouping/numbering across SMs?
Warp Scheduler
ALUs L1 Caches
Warp Scheduler
ALUs L1 Caches
SIMT Core-1 SIMT Core-2
CTA-2 CTA-4 CTA-1 CTA-3
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
25
Warp Scheduling Policy n All launched warps on a SIMT core have equal priority
q Round-Robin execution n Problem: Many warps stall at long latency operations
roughly at the same time
49
W
W
CTA W
W
CTA W
W
CTA W
W
CTA
All warps compute
All warps have equal priority
W
W
CTA W
W
CTA W
W
CTA W
W
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core Stalls
Time
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Solu2on
50
W W
CTA
W W
CTA
W W
CTA
W W
CTA
W W
CTA
W W
CTA
W W
CTA
W W
CTA
Send Memory Requests
Saved Cycles
W
W
CTA W
W
CTA W
W
CTA W
W
CTA
All warps compute
All warps have equal priority
W
W
CTA W
W
CTA W
W
CTA W
W
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core Stalls
Time
• Form Warp-Groups (Narasiman MICRO’11) • CTA-Aware grouping • Group Switch is Round-Robin
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
26
(51)
Two Level Round Robin Scheduler
CTA0 CTA1
CTA3 CTA2
CTA4 CTA5
CTA7 CTA8
CTA12 CTA13
CTA15 CTA14
Group 0 Group 1
Group 3 Group 2
RR
RR
Thread
Agnostic to when pending misses
are satisfied
(52)
Key Idea
Executing Warps (EW)
• Relationship between EW, SW and CW?
• When EW ≠ CW, we get interference
• Principle – optimize reuse: Seek to make EW = CW
Stalled Warps (SW) Cache Resident Warps (CW)
27
Objec2ve 1: Improve Cache Hit Rates
53
CTA 1 CTA 3 CTA 5 CTA 7 CTA 1 CTA 3 CTA 5 CTA 7
Data for CTA1 arrives. No switching.
CTA 3 CTA 5 CTA 7 CTA 1 CTA 3 C5 CTA 7 CTA 1 C5
Data for CTA1 arrives.
T No Switching: 4 CTAs in Time T
Switching: 3 CTAs in Time T
Fewer CTAs accessing the cache concurrently à Less cache contention Time
Switch to CTA1.
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Reduc2on in L1 Miss Rates
54
n Limited benefits for cache insensitive applications
n What is happening deeper in the memory system?
0.00
0.20
0.40
0.60
0.80
1.00
1.20
SAD SSC BFS KMN IIX SPMV BFSR AVG. Normalize
d L1 M
iss Rates
8%
18%
Round-Robin CTA-Grouping CTA-Grouping-Prioritization
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
28
(55)
The Off-Chip Memory Path
Off-chip GDDR5
Off-chip GDDR5
CU 0
To
CU 15
CU 16
To
CU 31 MC
MC
MC
MC
MC MC
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Access patterns?
Ordering and buffering?
(56)
Inter-CTA Locality
Warp Scheduler
ALUs L1 Caches
Warp Scheduler
ALUs L1 Caches
CTA-2 CTA-4 CTA-1 CTA-3
DRAM DRAM DRAM DRAM
How do CTAs Interact at the MC and in DRAM?
29
(57)
Impact of the Memory Controller
• Memory scheduling policies v Optimize BW vs. memory
latency
• Impact of row buffer access locality
• Cache lines?
(58)
Row Buffer Locality
30
CTA Data Layout (A Simple Example)
59
A(0,0) A(0,1) A(0,2) A(0,3)
:
:
DRAM Data Layout (Row Major)
Bank 1 Bank 2 Bank 3
Bank 4 A(1,0) A(1,1) A(1,2) A(1,3)
:
:
A(2,0) A(2,1) A(2,2) A(2,3)
:
:
A(3,0) A(3,1) A(3,2) A(3,3)
:
:
Data Matrix
A(0,0) A(0,1) A(0,2) A(0,3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1) A(3,2) A(3,3)
mapped to Bank 1 CTA 1 CTA 2
CTA 3 CTA 4
mapped to Bank 2
mapped to Bank 3
mapped to Bank 4
Average percentage of consecutive CTAs (out of total CTAs) accessing the same row = 64%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
L2 Cache
Implica2ons of high CTA-‐row sharing
60
W CTA-1
W CTA-3
W CTA-2
W CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
Idle Banks
W W W W
CTA Prioritization Order CTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
31
High Row Locality Low Bank Level Parallelism
Bank-1
Row-1
Bank-2
Row-2
Bank-1
Row-1
Bank-2
Row-2
Req Req
Req
Req
Req
Req
Req
Req
Req
Req
Lower Row Locality Higher Bank Level
Parallelism
Req
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(62)
Some Additional Details
• Spread reference from multiple CTAs (on multiple SMs) across row buffers in the distinct banks
• Do not use same CTA group prioritization across SMs v Play the odds
• What happens with applications with unstructured, irregular memory access patterns?
32
L2 Cache
Objec2ve 2: Improving Bank Level Parallelism
W CTA-1
W CTA-3
W CTA-2
W CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
11% increase in bank-level parallelism
14% decrease in row buffer locality
CTA Prioritization Order CTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
L2 Cache
Objec2ve 3: Recovering Row Locality
W CTA-1
W CTA-3
W CTA-2
W CTA-4
SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
Memory Side Prefetching
L2 Hits!
SIMT Core-1
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
33
Memory Side Prefetching n Prefetch the so-far-unfetched cache lines in an already open row into
the L2 cache, just before it is closed
n What to prefetch? q Sequentially prefetches the cache lines that were not accessed by
demand requests q Sophisticated schemes are left as future work
n When to prefetch?
q Opportunistic in Nature q Option 1: Prefetching stops as soon as demand request comes for
another row. (Demands are always critical) q Option 2: Give more time for prefetching, make demands wait if
there are not many. (Demands are NOT always critical)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
IPC results (Normalized to Round-‐Robin)
0.6 1.0 1.4 1.8 2.2 2.6 3.0
SA
D
PV
C
SS
C
BFS
MU
M
CFD
KM
N
SC
P
FWT
IIX
SP
MV
JPE
G
BFS
R
SC
FFT
SD
2
WP
PV
R
BP
AVG
- T1
Nor
mal
ized
IPC
Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2
25% 31% 33%
n 11% within Perfect L2
44%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
34
(67)
Summary
• Coordinated scheduling across SMs, CTAs, and warps
• Consideration of effects deeper in the memory system
• Coordinating warp residence in the core with the presence of corresponding lines in the cache
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)
CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU
Workloads S. –Y Lee, A. A. Kumar and C. J Wu
ISCA 2015
35
(69)
Goal
• Reduce warp divergence and hence increase throughput
• The key is the identification of critical (lagging) warps
• Manage resources and scheduling decisions to speed up the execution of critical warps thereby reducing divergence
(70)
Review: Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp Context Warp Context Warp Context Warp Context
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
TB 0
Thread Block Control
L1/Shared Memory Limits the #thread
blocks
Limits the #threads
SM – Stream Multiprocessor
SP – Stream Processor
Locality effects
36
(71)
Evolution of Warps in TB
• Coupled lifetimes of warps in a TB v Start at the same time v Synchronization barriers v Kernel exit (implicit synchronization barrier)
Completed warps
Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
Region where latency hiding is less effective
(72)
Warp Criticality Problem
Host CPU
Intercon
nection Bu
s
GPU
SMX SMX SMX SMX
Kernel Distributor
SMX Scheduler Core Core Core Core
Registers
L1 Cache / Shard Memory
Warp Schedulers
Warp Context
Kernel M
anagem
ent U
nit
HW W
ork Que
ues
Pend
ing
Kernels
Memory Controller
PC Dim Param ExeBLKernel Distributor Entry
Control Registers
DRAML2 Cache
TB
Warp
Warp
Available registers: spatial underutilization
Registers allocated to completed warps in the
TB: temporal underutilization
The last warp
Warp Context
Completed (idle) warp contexts
Manage resources and schedules around Critical Warps
Temporal & spatial underutilization
37
(73)
Warp Execution Time Disparity
• Branch divergence, interference in the memory system, scheduling policy, and workload imbalance
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
(74)
Branch Divergence
Warp-to-warp variation in dynamic instruction count (w/o branch divergence)
Intra-warp branch
divergence Example: traversal over constant node degree graphs
38
(75)
Impact of the Memory System
Executing Warps
Intra-wavefront footprints in the cache
• Could have been re-used • Too slow!
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
(76)
Impact of Warp Scheduler
• Amplifies the critical warp effect
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
39
(77)
Criticality Predictor Logic
m instructions
n instructions
• Non-divergent branches can generate large differences in dynamic instruction counts across warps (e.g., m>>n)
• Update CPL counter on branch: estimate dynamic instruction count
• Update CPL counter on instruction commit
(78)
CPL Calculation
m instructions
n instructions
nCriticality = nInstr *w.CPIavg + nStall
Inter-instruction memory stall cycles
Average warp CPI Average instruction Disparity between warps
40
(79)
Scheduling Policy
• Select warp based on criticality
• Execute until no more instructions are available v A form of GTO
• Critical warps get higher priority and large time slice
(80)
Behavior of Critical Warp References
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
41
(81)
Criticality Aware Cache Prioritization
• Prediction: critical warp • Prediction: re-reference interval • Both used to manage the cache foot print
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
(82)
Integration into GPU Microarchitecture
Figure from S. Y. Lee, et. al, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration in GPGPU Workloads,” ISCA 2015
Criticality Prediction Cache Management
42
(83)
Performance
Periodic computation of accuracy for critical warps
Due in part to low miss rates
(84)
Summary
• Warp divergence leads to some lagging warps à critical warps
• Expose the performance impact of critical warps à throughput reduction
• Coordinate scheduler and cache management to reduce warp divergence