sandeep navada © 2013 a unified view of non-monotonic core selection and application steering in...

Sandeep Navada © 2013

A Unified View of Non-monotonic Core Selection and

Application Steering in Heterogeneous Chip

MultiprocessorsSandeep Navada, Niket K. Choudhary,

Salil Wadhavkar, Eric Rotenberg

Department of Electrical and Computer Engineering

North Carolina State University1

Single-ISA HCMP

• Same ISA• Different microarchitectures

– Superscalar width– Structure sizes– Frequency

• Cores have different performance and power

• New run-time optimization lever


2

Monotonic HCMP

• Cores can be ranked independent of application• Core 1 faster than Core 2 for any application


3

A B C D

Core 1Core 2

Applications

Per

form

ance

Monotonic HCMP example


4

HCMP literature• Focus

– Monotonic cores– Cores are preordained– Scheduling

• Single thread– Minimize energy for given performance

degradation threshold w.r.t. highest ranked core• Multiple threads

– Maximize throughput/Watt/mm2


5

Going beyond monotonic HCMP

• Cores can’t be ranked independent of application• Cores designed from ground-up, not pre-existing


6

A B C D

Core 1Core 2

Applications

Per

form

ance

Non-monotonic HCMP

High-contention scenario

(Optimize throughput)

Kumar, et al., Core Architecture

Optimization for Single-ISA Heterogeneous

Multiprocessors

Low-contention scenario

(Optimize latency)Our work


7

Optimize latency

Complexity

App AIPCfrequencyperf


8

Performance = IPC × frequencyComplexity↑ => IPC↑ frequency↓

Complexity

App BIPCfrequencyperf

This tradeoff plays out differently for different apps and is dependent on the ILP characteristics of the app

Non-monotonic HCMP challenges

Core Selection

How to pick the core types

comprising the heterogeneous

design?

Application Steering

How to steer the applications to the

best core?


9

10

CORE SELECTION


Core design space


11

Parameter Value Range Number

Front end width 2, 3, 4, 5, 6, 7, 8 7

Issue width 2, 3, 4, 5, 6, 7, 8 7

Physical register file size

64, 128, 192, 256, 384, 512 6

Issue queue size 16, 24, 32, 48, 64, 96, 128 7

Load queue/Store queue size

8/8, 16/16, 24/24, 32/32, 40/40, 48/48, 56/56, 64/64

8

L1 I$ size 8, 16, 32, 64, 128KB 5

L1 D$ size 8, 16, 32, 64, 128KB 5

L2$ size 2MB 1

Clock period 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2 ns 8

Core selection


12

Core design space

Pruningscript

SPEC bench

SimPointtool

Pruned design Space

39 10M phases

FabScalar toolset

IPC, freq,power

Performance of every phase on

every design pointSearch

N=1 HCMP

Search N=2

HCMP

Search N=3

HCMP

Search N=4

HCMP

Optimal 1-core-type

HCMP

Optimal 2-core-type

HCMP

Optimal 3-core-type

HCMP

Optimal 4-core-type

HCMP

N: Number of core types


13

BIPSCore Types

A B C D E F G H

Phases1 1.5 3.2 1.3 2.2 1.6 1.7 1.3 2.0

2 0.5 2.3 2.5 1.9 3.1 1.8 2.0 1.2

Search for Optimal 4-core-type HCMP

Core 1 Core 2 Core 3 Core 4 Performance

A B C D

E B C D

A F C D

E F C D

E F G H

…

HMEAN(3.2, 2.5) = 2.81

HMEAN(3.2, 3.1) = 3.15

HMEAN(2.2, 2.5) = 2.34

HMEAN(2.2, 3.1) = 2.57

HMEAN(2.0, 3.1) = 2.43

Kiviat diagram• Visualize core parameters


14

Frequency

WindowWidth

larger structures

higher frequency

increase superscalar width

14

Optimal 1-core-type HCMP


15

Frequency

WindowWidth

A



16

Frequency

WindowWidth

A

“A” core is an average core which strikes a good bal-ance between IPC and frequency.

17



Frequency

WindowWidth

ALW

18



Frequency

WindowWidth

ALW

“A” core is still selected!

19



Frequency

WindowWidth

ALW

“LW” core targets window and width bottlenecksin “A” core.

LARGERWIDER

20



Frequency

WindowWidth

ALWN

21



Frequency

WindowWidth

ALWN

“A” core is still selected!!

22



Frequency

WindowWidth

ALWN

“LW” core is still selected.

23



Frequency

WindowWidth

ALWN

“N” core targets frequency bottleneck.



24

Frequency

WindowWidth

ALWN



25

Frequency

WindowWidth

ALWN

“A” and “N” are selected, again.

“LW” got split into “L” and “W”,addressing each bottleneck better!

LW split


26

Frequency

WindowWidth

ALWLW

Optimal HCMP


27

The optimal HCMP consists of1. Average core which is the best homogeneous core2. Accelerator cores that relieve distinct bottlenecks in

the average core

Core Type Clock Period ILP-extracting buffers

Widths Caches

A 0.6 32, 128, 128 3, 4 64, 64

N 0.5 32, 64, 64 2, 2 16, 16

L 0.7 48, 128, 384 4, 4 128, 128

W 0.7 32, 128, 128 6, 6 128, 32

28

APPLICATION STEERING


Bottleneck-driven steering

• Application is continuously diagnosed for bottlenecks on the current core using perf. counters

• Migrate to different core when bottlenecks change– To an accelerator core that relieves any diagnosed

bottleneck and doesn’t worsen any diagnosed bottleneck– To the average core if no accelerator meets this condition,

or if no bottlenecks


29

Bottleneck-driven steering


30

Track performance counters

Diagnose bottlenecks

Steer phase

Track performance counters


31

Counter Description

Width_ctr Ready instruction not issued due to limited issue width.

Window_ctr Instruction not dispatched due to issue queue or reorder buffer full.

I$_ctr Instruction stalled due to instruction cache miss.

D$_ctr Load instruction stalled due to data cache miss.

Misp_ctr Mispredicted branch.

L2_ctr Instruction stalled due to L2 cache miss.

Cycle_ctr Number of cycles.

Diagnose bottlenecks• Every 10K instructions, evaluate bottlenecks

using performance counters and thresholds

• Performance counters are normalized with respect to the cycle count

• If the normalized performance counter value is above threshold, then the corresponding resource is a bottleneck


3232

Diagnose bottlenecks


33

Bottleneck Expression

bool Width Width = (Width_ctr > Width_thresh)

bool Window Window = (Window_ctr > Window_thresh)

bool Frequency Frequency = (Misp_ctr > Misp_thresh) || (L2_ctr > L2_thresh)

bool I$ I$ = (I$_ctr > I$_thresh)

bool D$ D$ = (D$_ctr > D$_thresh)

Thresholds are determined empirically using a training process

Steer phase


34

Core Bottlenecks relieved

Bottlenecks worsened

Steering logic

W Width Frequency if (Width && !Frequency)W

L Window Frequency else if (Window && !Frequency)L

N Frequency Width, Window

else if (Frequency && !(Width || Window))N

A n/a n/a elseA

Paper shows full steering logic with I$ and D$ bottlenecks included.

35

RESULTS


Methodology• Benchmarks: SPEC 2000

– Simulate first 4 billion instructions• Metrics

– Performance: BIPS– Efficiency: BIPS3/Watt

• Migration overhead – Default: 100 cycles– Sensitivity study: 1K, 10K cycles


36

Steering algorithmsAlgorithm Description

Baseline Run the entire 4B instructions on the average core

Sampling Run on each core type for the sampling interval and then on the best core type for the switching interval

Bottleneck Run current 10K instruction segment based on the bottlenecks of the prior 10K segment

Optimal Run every 10K instruction segment on the best core type of the prior 10K segment

Oracle Run every 10K instruction segment on the best core type


37

4-core-type HCMP


38

• 4-core HCMP outperforms homogeneous CMP by up to 76% and 15%, on average

• Our steering algorithm is able to capture most of this gain

Sampling vs. bottleneck steering


39

Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core

Sampling performs 8.9% better than the average coreBottleneck steering performs 12% better than the average core

Occupancy


40

Occupancy pattern varies dramatically across different applications

Efficiency


41

Sampling performs 25% better than the average coreBottleneck steering performs 33% better than the average core

42

SUMMARY


43

Summary

• First proposal to architect and orchestrate multiple core types for latency reduction.

• With N core types, the optimal HCMP consists of an average core type coupled with N-1 accelerator core types.

• In the complementary steering algorithm, the application is continuously diagnosed for bottlenecks and is migrated to the core type which relieves the bottlenecks.


Future work

• HCMPs open up a whole new direction of microarchitecture research.

• Many microarchitecture optimizations don’t provide universal benefits.

• As each core-type targets a narrow workload space, HCMP provides a great platform to reconsider these optimizations.


44

sandeep navada © 2013 a unified view of non-monotonic core selection and application steering in...

Documents

coretype hcmpsandeep

coretype hcmpoptimal

applicationsandeep navada

coretype hcmpcore

coretype hcmp n

cores core selection

core selection process

latency sandeep navada