warp-level divergence in gpus: characterization , impact , and mitigation
DESCRIPTION
Ping Xiang , Yi Yang, Huiyang Zhou. Warp-Level Divergence in GPUs: Characterization , Impact , and Mitigation. The 20th IEEE International Symposium On High Performance Computer Architecture , Orlando, Florida, USA. 1. Outline. Background Motivation Mitigation: WarpMan Experiments - PowerPoint PPT PresentationTRANSCRIPT
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation
Ping Xiang, Yi Yang, Huiyang Zhou
1The 20th IEEE International Symposium On High Performance Computer Architecture, Orlando, Florida, USA
Outline
• Background
• Motivation
• Mitigation: WarpMan
• Experiments
• Conclusions
2
Register File
Threads
3
Overview of GPU Architecture
ALU
ALUControl
ALU
ALU
Cache
ALU
ALU
ALU
ALU
Warp
DRAM
WarpWarp
TB
…
Shared Memory
Motivation:
• Typically large TB size (512, e.g.) – More efficient data sharing/communication within a TB– Limited total TB number
Register File
TB
TB
Unused Registers
Resource Fragmentation
4
Motivation Warp-Level Divergence:
….TB
Warp1Warp2Warp3Warp4
FinishedFinished
warps within the same TB don’t finish at the same time
Resources cannot be released promptly
Unused Resources
5
Outline
• Background
• Motivation– Characterization:
• Mitigation: WarpMan
• Experiments
• Conclusions
6
Characterization:
Register File
TB
TB
Unused Resources
Spatial Resource
underutilization
FinishedFinished
TemporalResource
underutilization
7
Spatial Resource UnderutilizationRegister resource as an example
28% 17%
RS(2) HS(3) RAY(2) MM(5) NN(5) CT(7) MC(4) HG(3) ST(1) GM0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TB8TB7TB6TB5TB4TB3TB2TB1
46%
8
Temporal Resource Underutilization
• Case Study: Ray Tracing– 6 warps per TB– Study TB0 as an example
0
1
2
3
4
5
0 5000 10000 15000 20000 25000
Warp Level Divergence for RAY
Cycle
Warp Num.
RTRU = 49.7%
RTRU: ratio of temporal resource underutilization
9
Why There Is Temporal Resource Underutilization?
• Input-dependent workload imbalance – Same code, different input: “if(a < 123)”
• Program-dependent workload imbalance– Code like if(tid < 32)
• Memory divergence– Some warps experience more cache hits than others
• Warp scheduling policy– Scheduler prioritizes certain warps than others
10
Characterization: RTRU
CT MC RS SN HS PF SR ST RAY MM NN BT HG GM0%
10%
20%
30%
40%
50%
60%
70%
80%
90%Round Robin Scheduling Policy
11
Outline
• Background
• Motivation– Characterization:– Micro-benchmarking
• Mitigation: WarpMan
• Experiments
• Conclusions12
Micro-benchmark• Code runs on both GTX480 and GTX 680
• 1. __global__ void TB_resource_kernel(…, bool call = false){• 2. if(call) bloatOccupancy(start, size); • ...• 3. clock_t start_clock = clock();• 4. if(tid < 32){ //tid is the threadid within a TB• 5 clock_offset = 0; • 6. while( clock_offset < clock_count ) {• 7. clock_offset = clock() - start_clock;• 8. }• 9. }• 10. clock_t end_clock = clock();• 11. d_o[index] = start_clock; //index is the global thread id• 12. d_e[index] = end_clock;• 13.}
13
Micro-benchmarking
• Results> Using CUDA device [0]: GeForce GTX 480> Detected Compute SM 3.0 hardware with 8 multi-processors.…CTA 250 Warp 0: start 80, end 81CTA 269 Warp 0: start 80, end 81CTA 272 Warp 0: start 80, end 81CTA 283 Warp 0: start 80, end 81CTA 322 Warp 0: start 80, end 81CTA 329 Warp 0: start 80, end 81…
14
Outline
• Background
• Motivation
• Mitigation: WarpMan
• Experiments
• Conclusions
15
WarpMan
SM
TB0
TB1
TB-level Resource Management
Unused Resources
Finished Warp2
Finished Warp1
Finished Warp0
TB2
16
Warp Level Resource Management
cycle
Workload
TB0 TB2
TB1
Warp2
Warp0Warp1
SM
TB0
TB1
TB-level Resource Management WarpMan
SM
TB0
TB1
Unused Resources
Finished Warp
Finished Warp
Finished Warp
TB2
Warp0 From TB2Warp1 From TB2
FinishedReleased ResourceWarp2 From TB2
17
cycle
Workload
Warp0 and warp 1
WarpMan
TB0
TB1Warp2 warp2
Warp0Warp1 Saved Cycle
Warp Level Resource Management
WarpMan ---- Design
• Dispatch logic– Traditional TB-level dispatching logic– Add partial TB dispatch logic
• Workload buffer– Store the dispatched but not running partial TBs
18
Dispatching
TB-level Resource Check
Warp-level Resource Check
Resources required for a TB
Resources required for a Warp
A full TB
A partial TB
Workload to be dispatched
Shared memoryWarp entriesTB entriesRegisters
19
The shared memory is still allocated at the TB level
Workload Buffer
• Store the dispatched but not running TB– Hardware TB id (assigned by the hardware)– Software TB id (defined by the software)– Start warp id – End warp id– Valid bit
3
26
5
5
1
40bits
20
Workload Buffer
120
0
2
1
21
Store the dispatched but not running TB
TB120
WarpMan
SM
TB118
TB117
Unused Resources
TB Num
Warp0 From TB120Warp1 From TB120
Start Warp IDEnd Warp IDValid
Workload buffer
FinishedWarp2 From TB120
0
12
Outline
• Background
• Motivation:
• Mitigation: WarpMan
• Experiments
• Conclusions
22
Methodology
• Use GPUWattch for both timing and energy evaluation
• Baseline Architecture: (GTX480)– 15 SMs, with SIMD size of 32, running at 1.4Ghz– Max TBs per SM is 8, Max threads per SM is 1536– Scheduling policy: round robin / two level– 16KB L1 cache, 48 KB shared memory. 128KB regs
• Applications from:• Nvidia CUDA SDK• Rodinia Benchmark Suit• GPGPUsim
23
Performance Results:
• temp: allow early finished warps to release resource for new warps• temp + spatial: resources are allocated /released at warp level • The performance improvements can be as high as 71%/76%• On average, 15.3% improvements
CT MC RS SN HS PF SR ST RAY MM NN BT HG GM100%
110%
120%
130%
140%
150% 171%
1.71414827194826
176%
Performance Improvementtemp temp+spatial
24
Energy Results
CT MC RS SN HS PF SR ST RAY MM NN BT HG GM70%
75%
80%
85%
90%
95%
100%
Normalized energy consumptiontemp temp+spatial
The energy savings can be as high as over 20%, and 6% on average
A Software Alternative
• Change the software to have a smaller TB size• Change the hardware to enable more concurrent TBs
• Inefficient shared memory usage / synchronization• Decrease the data locality• More as we proceed to the experimental results…
a smaller TB size
26
Comparing to the Software Alternative
• CT and ST: software approach decreases L1 locality• NN and BT: reduced total number of threads
• On average: 25% improvement VS 48% degradation
CT MC ST Ray NN BT GM0%
20%40%60%80%
100%120%140%160%180%
Performance Improvmenttemp+spatial TBsize_32
125%
52%
27
Related Work
• Resource underutilization due to branch divergence or thread-level divergence has been well studied.
• Yi Yang et al [Pact-21] targets at the shared memory resource management and is complementary to our proposed WarpMan scheme.
• D. Tarjan, et al [US Patent-2009], proposes to use virtual register table to manage physical register file to enable more concurrent TBs
28
Conclusion
• We highlight the limitations of TB-level resource management
• we characterize warp-level divergence and reveal the fundamental reasons for such divergent behavior;
• we propose WarpMan and show that it can be implemented with minor hardware changes
• we show that our proposed solution is highly effective and achieves significant performance improvements and energy savings
Questions?29