improving gpu performance via large warps and two-level warp scheduling veynu narasiman the...
TRANSCRIPT
![Page 1: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/1.jpg)
Improving GPU Performance viaLarge Warps and Two-Level Warp
Scheduling
Veynu NarasimanThe University of Texas at Austin
Michael ShebanowNVIDIA
Chang Joo LeeIntel
Rustam MiftakhutdinovThe University of Texas at Austin
Onur MutluCarnegie Mellon University
Yale N. PattThe University of Texas at Austin
MICRO-44December 6th, 2011Porto Alegre, Brazil
![Page 2: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/2.jpg)
Rise of GPU Computing
GPUs have become a popular platform for general purpose applications
New Programming ModelsCUDAATI Stream TechnologyOpenCL
Order of magnitude speedup over single-threaded CPU
![Page 3: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/3.jpg)
How GPUs Exploit ParallelismMultiple GPU cores (i.e., Streaming Multiprocessors)
Focus on a single GPU core
Exploit parallelism in 2 major ways:
Threads grouped into warps Single PC per warp Warps executed in SIMD fashion
Multiple warps concurrently executedRound-robin schedulingHelps hide long latencies
![Page 4: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/4.jpg)
The ProblemDespite these techniques, computational resources
can still be underutilized
Two reasons for this:
Branch divergence
Long latency operations
![Page 5: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/5.jpg)
Branch Divergence
1111A
1111D
1001B 0110C
D1111D
C0110D
ReconvergePC
ActiveMask
ExecutePC
Current PC:Current Active Mask:
A1111B1001
Taken Not Taken
![Page 6: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/6.jpg)
Long Latency Operations
time
Core
MemorySystem
All Warps Compute
Req Warp 0
All Warps Compute
Req Warp 1
Req Warp 15
Round Robin Scheduling, 16 total warps
![Page 7: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/7.jpg)
0%10%20%30%40%50%60%70%80%90%
100%
Computational Resource Utilization
32
24 to 31
16 to 23
8 to 15
1 to 7
0
32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling
Good
Bad
![Page 8: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/8.jpg)
Large Warp Microarchitecture (LWM)
Alleviates branch divergence
Fewer, but larger warpsWarp size much greater than SIMD width
Total thread count and SIMD-width stay the same
Dynamically breaks down large warp into sub-warpsCan be executed on existing SIMD pipeline
Rearrange active mask as 2D structureNumber of columns = SIMD widthSearch each column for an active thread to create new sub-warp
![Page 9: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/9.jpg)
Large Warp Microarchitecture Example
Decode Stage
1 0 0 10 1 0 00 0 1 11 0 0 00 0 1 00 1 0 01 0 0 10 1 0 0
0 00
01 1 1 1
00
00
1 1 1 1
0 00
1 1 1 11 1 0 1
Sub-warp 0 mask Sub-warp 0 maskSub-warp 1 mask Sub-warp 0 maskSub-warp 1 maskSub-warp 2 mask
1 1 1 1 1 1 1 1
![Page 10: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/10.jpg)
More Large Warp Microarchitecture
Divergence stack still usedHandled at the large warp level
How large should we make the warps?More threads per warp more potential for sub-warp creationToo large a warp size can degrade performance
Re-fetch policy for conditional branchesMust wait till last sub-warp finishes
Optimization for unconditional branch instructionsDon’t create multiple sub-warpsSub-warping always completes in a single cycle
![Page 11: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/11.jpg)
Two Level Round Robin Scheduling
Split warps into equal sized fetch groups
Create initial priority among the fetch groups
Round-robin scheduling among warps in same fetch group
When all warps in the highest priority fetch group are stalledRotate fetch group prioritiesHighest priority fetch group becomes least
Warps arrive at a stalling point at slightly different points in timeBetter overlap of computation and memory latency
![Page 12: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/12.jpg)
Round Robin vs Two Level Round Robin
time
Core
MemorySystem
All Warps Compute
Req Warp 0
All Warps Compute
Req Warp 1
Req Warp 15
Round Robin Scheduling, 16 total warps
time
Core
MemorySystem
Compute
Req Warp 0Req Warp 1
Req Warp 7
Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each
Group 0
Compute
Group 1
Req Warp 8Req Warp 9
Req Warp 15
Compute
Group 0
Compute
Group 1
Saved Cycles
![Page 13: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/13.jpg)
More on Two Level SchedulingWhat should the fetch group size be?
Enough warps to keep pipeline busy in the absence of long latency stallsToo small
Uneven progression of warps in the same fetch group Destroys data locality among warps
Too large Reduces benefits of two-level scheduling More warps stall at the same time
Not just for hiding memory latencyComplex instructions (e.g., sine, cosine, sqrt, etc.)Two-level scheduling allows warps to arrive at such instructions at slightly
different points in time
![Page 14: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/14.jpg)
Combining LWM and Two Level Scheduling
4 large warps, 256 threads eachFetch group size = 1 large warp
Problematic for applications with few long latency stallsNo stalls no fetch group priority changesSingle large warp starvedBranch re-fetch policy for large warps bubbles in pipeline
Timeout invoked fetch group priority change32K instruction timeout periodAlleviates starvation
![Page 15: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/15.jpg)
Methodology
Scalar Front End
1-wide fetch, decode
4KB single ported I-Cache
Round-robin scheduling
SIMD Back End In order, 5 stages, 32 parallel SIMD lanes
Register File and On Chip
Memories
64KB Register File
128KB, 4-way, D-Cache with 128B line size
128KB, 32-banked private memory
Memory System
Open row, first-come first-serve scheduling
8 banks, 4KB row buffer per bank
100-cycle row hit latency, 300-cycle row conflict latency
32 GB/s memory bandwidth
Simulate single GPU core with 1024 thread contextsdivided into 32 warps each with 32 threads
![Page 16: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/16.jpg)
Overall IPC Results
0
5
10
15
20
25
30
35
blac
kjac
k
sor
t
vite
rbi
km
eans
dec
rypt
bla
cksc
hole
s
nee
dlem
an
hot
spot
mat
rix_m
ult
redu
ctio
n
his
togr
am
IPC
Baseline TBC LWM 2Lev LWM+2Lev
0.0
0.1
0.2
0.3
0.4
0.5
0.6
bfs
0
5
10
15
20
25
30
35
gmea
n
LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC
![Page 17: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/17.jpg)
IPC and Computational Resource Utilization
02468
10121416
baseline LWM 2LEV LWM+2LEV
IPC
IPC for blackjack
0%
20%
40%
60%
80%
100%
120%
baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for blackjack
32
24 to 31
16 to 23
8 to 15
1 to 7
0
0
5
10
15
20
25
baseline LWM 2LEV LWM+2LEV
IPC
IPC for histogram
0%
20%
40%
60%
80%
100%
120%
baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for histogram
32
24 to 31
16 to 23
8 to 15
1 to 7
0
![Page 18: Improving GPU Performance via Large Warps and Two-Level Warp Scheduling Veynu Narasiman The University of Texas at Austin Michael Shebanow NVIDIA Chang](https://reader035.vdocuments.us/reader035/viewer/2022062715/56649daf5503460f94a9ce77/html5/thumbnails/18.jpg)
ConclusionFor maximum performance, the computational resources on
GPUs must be effectively utilized
Branch divergence and long latency operations cause them to be underutilized or unused
We proposed two mechanism to alleviate thisLarge Warp Microarchitecture for branch divergenceTwo-level scheduling for long latency operations
Improves performance by 19.1% over traditional GPU coresIncreases scope of applications that can run efficiently on a GPU
Questions