the performance potential for single application heterogeneous systems
DESCRIPTION
Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British Columbia. The Performance Potential for Single Application Heterogeneous Systems. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/1.jpg)
1
The Performance Potential for Single Application Heterogeneous Systems
Henry Wong* and Tor M. Aamodt§
*University of Toronto§University of British Columbia
![Page 2: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/2.jpg)
2
Intuition suggests integrating parallel and sequential cores on a single chip should provide performance benefits by lowering communication overheads.
![Page 3: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/3.jpg)
3
This work: Perform limit study of heterogeneous architecture performance when running a single general purpose application.
Two main results:
• Single thread performance (read-after-write latency) of GPUs ought to improve for GPUs to accelerate a wider set of non-graphics workloads.
• Putting CPU and accelerator on single chip does not seem to improve performance “much” versus separate CPU and accelerator.
![Page 4: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/4.jpg)
4
OutlineIntroduction
Background:
- GPU Computing / Heterogeneous
- Barrel processing (relevant to GPUs)
Limit Study Model
- Sequential and Parallel Models
- Dynamic programming algorithm
- Modeling Bandwidth
Results
![Page 5: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/5.jpg)
5
Graphics Processing Unit (GPU)
PolygonsTexturesLights
![Page 6: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/6.jpg)
6
Programmable GPU
• Rendering pipeline
• Polygons go in
• Pixels come out
• DX10 has 3 programmable stages
![Page 7: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/7.jpg)
7
GPU/Stream Computing
• Use shader processors without rendering pipeline
• C-like high-level language for convenience
![Page 8: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/8.jpg)
8
Separate GPU + CPU
• Off-chip latency
• Copy data between memory spaces
![Page 9: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/9.jpg)
9
Single-Chip
• Lower latency
• Single memory address space: Share data, don't copy
![Page 10: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/10.jpg)
10
Sequential Performance of Parallel Processor
• Contemporary GPUs have slow single thread performance.
• “Designed for cache miss” => use “barrel processing” to hide off-chip latency.
• This impacts minimum read-to-write latency for a single thread.
• Not an issue if you have 106 pixels each requiring 100 instruction long thread.
![Page 11: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/11.jpg)
11
Sequential Performance of Parallel Processor
• GPUs can do many operations per clock cycle
• Nvidia G80 needs 3072 independent instructions every 24 clocks to keep pipelines filled
• Can model G80 as executing up to 3072 independent scalar instructions every 24 clocks
• For single thread CPU produces results ~100x faster:
• 2 IPC * 2 clock speed * 24 instruction latency
• Parallel Instruction Latency = ratio of read-to-write latency of dependent instructions on parallel processor (measured in CPU clock cycles) to CPU CPI.
![Page 12: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/12.jpg)
12
Limit Study
• Optimistic abstract model of GPU and CPU
• “ILP limit study”-type trace analysis with optimistic assumptions.
• Assume constant CPI (=1.0) for sequential core.
• Parallel processor is ideal data flow processor, but with read-after-write latency some multiple of the sequential core clock.
• Parallel processor has unlimited parallelism
• Optimally schedule instructions on cores using dynamic programming algorithm.
![Page 13: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/13.jpg)
13
Trace Analysis Assumptions
• Perfect branch prediction
• Perfect memory disambiguation
• Remove stack-pointer dependencies
• Remove induction variable dependencies by removing all instructions that depend (dynamically) only on compile time constants.
![Page 14: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/14.jpg)
14
Scheduling a Trace
![Page 15: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/15.jpg)
15
Dynamic Programming• Switching between processors takes time
• Find optimal schedule by decomposing problem, using optimal solution to subproblem to create optimal solution to larger problem.
• Input: Trace of N instructions.
• Output: Optimum (minimum) number of cycles required to execute on abstract heterogeneous processor model.
serialparallelserialparallel
instructions
![Page 16: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/16.jpg)
16
![Page 17: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/17.jpg)
17
Bandwidth
Latency of mode switch depends upon amount of data consumed on new processor produced by old processor. Use earliest-deadline-first scheduling. Simple model of bandwidth, e.g., max 32-bits every 8 cycles. Allow overlap of computation with communication.
Iterative model: Use average mode switch latency from last iteration as fixed mode switch latency for next iteration. Results based upon actual implied latency of last iteration.
![Page 18: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/18.jpg)
18
• PTLSim (x86-64): micro-op traces
• SimPoint (phase classification): ~12 x 10M instruction segments.
• Benchmarks: Spec 2000, PhysicsBench, SimpleScalar (used as a benchmark), microbenchmarks.
Experiment Setup
![Page 19: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/19.jpg)
19
Average Parallelism
As in prior ILP limit studies: lots of parallelism.
![Page 20: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/20.jpg)
20
Instructions Scheduled on Parallel Cores
As parallel processor’s sequential performance gets worse, more instructions scheduled on sequential core.
![Page 21: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/21.jpg)
21
Parallelism on Parallel Processor
As parallel processor’s sequential performance gets worse, work scheduled on parallel core needs to be more parallel.
![Page 22: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/22.jpg)
22
Speedup over Sequential Core
Applications exist with enough parallelism to fully utilize GPU function units.
GPU
GPU
![Page 23: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/23.jpg)
23
Speedup over Sequential Core
“General Purpose” Workloads: Performance limited by sequential performance (read-after-write latency) of parallel cores.
GPU
GPU
![Page 24: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/24.jpg)
24
Slowdown of infinite communication cost (NoSwitch)
Up to 5x performance improvement versus infinite cost. Communication cost matters most for GPU like parallel instruction latency. So, put on same chip?
![Page 25: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/25.jpg)
25
Slowdown due to 100,000 cycles of mode-switch latency
Can achieve 85% of the performance of single-chip with large (but not infinite) mode switch latency.
![Page 26: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/26.jpg)
26
Mode Switches
Number of mode switches decreases with increasing mode switch cost.
More mode switches occur at intermediate values of parallel instruction latency.
zero cycles
10 cycles
1000 cycles
![Page 27: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/27.jpg)
27
PCI Express-like Bandwidth (and Latency)
1.07x to 1.48x performance improvement if reduce latency to zero and make bandwidth infinite. Less improvement if parallel instruction latency reduced--e.g. for better accelerator architecture.
![Page 28: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/28.jpg)
28
Conclusions & Caveats
• GPUs could tackle more general-purpose applications if single thread performance was better.
• Performance improvement due to integrating CPU and accelerator on single chip (versus separate CPU and accelerator) does not appear staggering. Bandwidth has greater impact than latency.
• Caveats:
• It’s a limit study.
• Heterogeneous may still make sense for other reasons… e.g., if cheaper to add parallel cores than another chip sockets, power, etc…
![Page 29: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/29.jpg)
29
Future Work
• Control dependence analysis
• Model interesting design points in more detail
![Page 30: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/30.jpg)
30
Bandwidth sensitivity for GPU-like parallel instruction latency
![Page 31: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/31.jpg)
31
Proportion of instructions on parallel processor
![Page 32: The Performance Potential for Single Application Heterogeneous Systems](https://reader036.vdocuments.us/reader036/viewer/2022062410/56815968550346895dc6a697/html5/thumbnails/32.jpg)
32
Slowdown of infinite communication
Twophase shows strong sensitivity to communication latency for widely varying parallel instruction latency