sandeep navada © 2013 a unified view of non-monotonic core selection and application steering in...
TRANSCRIPT
Sandeep Navada © 2013
A Unified View of Non-monotonic Core Selection and
Application Steering in Heterogeneous Chip
MultiprocessorsSandeep Navada, Niket K. Choudhary,
Salil Wadhavkar, Eric Rotenberg
Department of Electrical and Computer Engineering
North Carolina State University1
Single-ISA HCMP
• Same ISA• Different microarchitectures
– Superscalar width– Structure sizes– Frequency
• Cores have different performance and power
• New run-time optimization lever
Sandeep Navada © 2013
2
Monotonic HCMP
• Cores can be ranked independent of application• Core 1 faster than Core 2 for any application
Sandeep Navada © 2013
3
A B C D
Core 1Core 2
Applications
Per
form
ance
Monotonic HCMP example
Sandeep Navada © 2013
4
HCMP literature• Focus
– Monotonic cores– Cores are preordained– Scheduling
• Single thread– Minimize energy for given performance
degradation threshold w.r.t. highest ranked core• Multiple threads
– Maximize throughput/Watt/mm2
Sandeep Navada © 2013
5
Going beyond monotonic HCMP
• Cores can’t be ranked independent of application• Cores designed from ground-up, not pre-existing
Sandeep Navada © 2013
6
A B C D
Core 1Core 2
Applications
Per
form
ance
Non-monotonic HCMP
High-contention scenario
(Optimize throughput)
Kumar, et al., Core Architecture
Optimization for Single-ISA Heterogeneous
Multiprocessors
Low-contention scenario
(Optimize latency)Our work
Sandeep Navada © 2013
7
Optimize latency
Complexity
App AIPCfrequencyperf
Sandeep Navada © 2013
8
Performance = IPC × frequencyComplexity↑ => IPC↑ frequency↓
Complexity
App BIPCfrequencyperf
This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app
Non-monotonic HCMP challenges
Core Selection
How to pick the core types
comprising the heterogeneous
design?
Application Steering
How to steer the applications to the
best core?
Sandeep Navada © 2013
9
10
CORE SELECTION
Sandeep Navada © 2013
Core design space
Sandeep Navada © 2013
11
Parameter Value Range Number
Front end width 2, 3, 4, 5, 6, 7, 8 7
Issue width 2, 3, 4, 5, 6, 7, 8 7
Physical register file size
64, 128, 192, 256, 384, 512 6
Issue queue size 16, 24, 32, 48, 64, 96, 128 7
Load queue/Store queue size
8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64
8
L1 I$ size 8, 16, 32, 64, 128KB 5
L1 D$ size 8, 16, 32, 64, 128KB 5
L2$ size 2MB 1
Clock period 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 8
Core selection
Sandeep Navada © 2013
12
Core design space
Pruningscript
SPEC bench
SimPointtool
Pruned design Space
39 10M phases
FabScalar toolset
IPC, freq,power
Performance of every phase on
every design pointSearch
N=1 HCMP
Search N=2
HCMP
Search N=3
HCMP
Search N=4
HCMP
Optimal 1-core-type
HCMP
Optimal 2-core-type
HCMP
Optimal 3-core-type
HCMP
Optimal 4-core-type
HCMP
N: Number of core types
Sandeep Navada © 2013
13
BIPSCore Types
A B C D E F G H
Phases1 1.5 3.2 1.3 2.2 1.6 1.7 1.3 2.0
2 0.5 2.3 2.5 1.9 3.1 1.8 2.0 1.2
Search for Optimal 4-core-type HCMP
Core 1 Core 2 Core 3 Core 4 Performance
A B C D
E B C D
A F C D
E F C D
E F G H
…
HMEAN(3.2, 2.5) = 2.81
HMEAN(3.2, 3.1) = 3.15
HMEAN(2.2, 2.5) = 2.34
HMEAN(2.2, 3.1) = 2.57
HMEAN(2.0, 3.1) = 2.43
Kiviat diagram• Visualize core parameters
Sandeep Navada © 2013
14
Frequency
WindowWidth
larger structures
higher frequency
increase superscalar width
14
Optimal 1-core-type HCMP
Sandeep Navada © 2013
15
Frequency
WindowWidth
A
Optimal 1-core-type HCMP
Sandeep Navada © 2013
16
Frequency
WindowWidth
A
“A” core is an average core which strikes a good bal-ance between IPC and frequency.
17
Optimal 2-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALW
18
Optimal 2-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALW
“A” core is still selected!
19
Optimal 2-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALW
“LW” core targets window and width bottlenecksin “A” core.
LARGERWIDER
20
Optimal 3-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALWN
21
Optimal 3-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALWN
“A” core is still selected!!
22
Optimal 3-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALWN
“LW” core is still selected.
23
Optimal 3-core-type HCMP
Sandeep Navada © 2013
Frequency
WindowWidth
ALWN
“N” core targets frequency bottleneck.
Optimal 4-core-type HCMP
Sandeep Navada © 2013
24
Frequency
WindowWidth
ALWN
Optimal 4-core-type HCMP
Sandeep Navada © 2013
25
Frequency
WindowWidth
ALWN
“A” and “N” are selected, again.
“LW” got split into “L” and “W”,addressing each bottleneck better!
LW split
Sandeep Navada © 2013
26
Frequency
WindowWidth
ALWLW
Optimal HCMP
Sandeep Navada © 2013
27
The optimal HCMP consists of1. Average core which is the best homogeneous core2. Accelerator cores that relieve distinct bottlenecks in
the average core
Core Type Clock Period ILP-extracting buffers
Widths Caches
A 0.6 32, 128, 128 3, 4 64, 64
N 0.5 32, 64, 64 2, 2 16, 16
L 0.7 48, 128, 384 4, 4 128, 128
W 0.7 32, 128, 128 6, 6 128, 32
28
APPLICATION STEERING
Sandeep Navada © 2013
Bottleneck-driven steering
• Application is continuously diagnosed for bottlenecks on the current core using perf. counters
• Migrate to different core when bottlenecks change– To an accelerator core that relieves any diagnosed
bottleneck and doesn’t worsen any diagnosed bottleneck– To the average core if no accelerator meets this condition,
or if no bottlenecks
Sandeep Navada © 2013
29
Bottleneck-driven steering
Sandeep Navada © 2013
30
Track performance counters
Diagnose bottlenecks
Steer phase
Track performance counters
Sandeep Navada © 2013
31
Counter Description
Width_ctr Ready instruction not issued due to limited issue width.
Window_ctr Instruction not dispatched due to issue queue or reorder buffer full.
I$_ctr Instruction stalled due to instruction cache miss.
D$_ctr Load instruction stalled due to data cache miss.
Misp_ctr Mispredicted branch.
L2_ctr Instruction stalled due to L2 cache miss.
Cycle_ctr Number of cycles.
Diagnose bottlenecks• Every 10K instructions, evaluate bottlenecks
using performance counters and thresholds
• Performance counters are normalized with respect to the cycle count
• If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck
Sandeep Navada © 2013
3232
Diagnose bottlenecks
Sandeep Navada © 2013
33
Bottleneck Expression
bool Width Width = (Width_ctr > Width_thresh)
bool Window Window = (Window_ctr > Window_thresh)
bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh)
bool I$ I$ = (I$_ctr > I$_thresh)
bool D$ D$ = (D$_ctr > D$_thresh)
Thresholds are determined empirically using a training process
Steer phase
Sandeep Navada © 2013
34
Core Bottlenecks relieved
Bottlenecks worsened
Steering logic
W Width Frequency if (Width && !Frequency)W
L Window Frequency else if (Window && !Frequency)L
N Frequency Width, Window
else if (Frequency && !(Width || Window))N
A n/a n/a elseA
Paper shows full steering logic with I$ and D$ bottlenecks included.
35
RESULTS
Sandeep Navada © 2013
Methodology• Benchmarks: SPEC 2000
– Simulate first 4 billion instructions• Metrics
– Performance: BIPS– Efficiency: BIPS3/Watt
• Migration overhead – Default: 100 cycles– Sensitivity study: 1K, 10K cycles
Sandeep Navada © 2013
36
Steering algorithmsAlgorithm Description
Baseline Run the entire 4B instructions on the average core
Sampling Run on each core type for the sampling interval and then on the best core type for the switching interval
Bottleneck Run current 10K instruction segment based on the bottlenecks of the prior 10K segment
Optimal Run every 10K instruction segment on the best core type of the prior 10K segment
Oracle Run every 10K instruction segment on the best core type
Sandeep Navada © 2013
37
4-core-type HCMP
Sandeep Navada © 2013
38
• 4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average
• Our steering algorithm is able to capture most of this gain
Sampling vs. bottleneck steering
Sandeep Navada © 2013
39
Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core
Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core
Occupancy
Sandeep Navada © 2013
40
Occupancy pattern varies dramatically across different applications
Efficiency
Sandeep Navada © 2013
41
Sampling performs 25% better than the average coreBottleneck steering performs 33% better than the average core
42
SUMMARY
Sandeep Navada © 2013
43
Summary
• First proposal to architect and orchestrate multiple core types for latency reduction.
• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types.
• In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.
Sandeep Navada © 2013
Future work
• HCMPs open up a whole new direction of microarchitecture research.
• Many microarchitecture optimizations don’t provide universal benefits.
• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.
Sandeep Navada © 2013
44