illusionist: transforming lightweight cores into ...cccp.eecs.umich.edu/slides/ansari-hpca13.pdf ·...

23
Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand HPCA-19 February 27, 2013 Amin Ansari 1 , Shuguang Feng 2 , Shantanu Gupta 3 , Josep Torrellas 1 , and Scott Mahlke 4 1 University of Illinois, Urbana-Champaign 2 Northrop Grumman Corp. 3 Intel Corp. 4 University of Michigan, Ann Arbor

Upload: nguyenduong

Post on 09-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Illusionist: Transforming Lightweight Cores into Aggressive Cores on Demand

HPCA-19 February 27, 2013

Amin Ansari1, Shuguang Feng2, Shantanu Gupta3, Josep Torrellas1, and Scott Mahlke4

1 University of Illinois, Urbana-Champaign2 Northrop Grumman Corp.

3 Intel Corp.4 University of Michigan, Ann Arbor

Adapting to Application DemandsNumber of threads to execute is not constant

o Many threads availableSystem with many lightweight cores achieves a better throughput

o Few threads availableSystem with aggressive cores achieves a better throughput

o Single-thread performance is always better with aggressive cores

Asymmetric Chip Multiprocessors (ACMPs):o Adapt to the variability in the number of threadso Limited in that there is no dynamic adaptation

To provide dynamic adaptation: o We use core coupling

2

Core1

Perf

orm

ance

Core2

3

Core CouplingTypically configured as leader/follower cores where the leader runs ahead and attempts to accelerates the follower

o Slipstreamo Master/slave Speculation

o Flea Flickero Dual-core Execution

o Paceline

o DIVA

The leader runs ahead by executing a “pruned” version of the application

The leader speculates on long-latency operations

The leader is aggressively frequency scaled (reduced safety margins)

A smaller follower core simplifies the design/verification of the leader core

Extending Core Coupling

AggressiveCore(AC)

LightweightCore

(LWC)

LightweightCore

Thro

ughp

ut

Configuration

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

HintsA 9 Core ACMP System

4

9 co

re A

CM

P

7 LW

Cs

+ a

coup

led

core

s

Illus

ioni

st

Illusionist vs Prior Work

AggressiveCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

LightweightCore

Hints

Higher single-thread performance for all LWCso By using a single aggressive coreo Giving the appearance of 8 semi-aggressive cores

5

Illusionist vs Prior WorkHints

Master Slave1 Slave2 Slave3A’ A

B’B

CC’

C’C

Master Slave Parallelization [Zilles’02]

6

Providing Hints for Many CoresOriginal IPC of the aggressive core ~2X of that of a LWC

We want an AC to keep up with a large number of LWCso We need to substantially reduce the amount of work that the

aggressive core needs to do per each thread running on a LWC

We need to run lower num of instructions per each threado We distill the program that the aggressive core needs to runo We limit the execution of the program only to most fruitful parts

The main challenge here is too Preserve the effectiveness of the hints while removing instructions

7

Program DistillationObjective: reduce the size of program while preserving the effectiveness of the original hints (branch prediction and cache hits)

Distillation techniqueso Aggressive instruction removal (on average, 77%)

Remove instructions which do not contribute to hint generationRemove highly biased branches and their back sliceRemove memory inst. accessing the same cache line

o Select the most promising program phasesPredictor that uses performance countersRegression model based on IPC, $ and BP miss rates

8

Example of Instruction Removal

9

if (high<=low) return;

srand(10);for (i=low;i<high;i++) {for (j=0;j<numf1s;j++) {if (i%low) {tds[j][i] = tds[j][0];tds[j][i] = bus[j][0];

} else {tds[j][i] = tds[j][1];tds[j][i] = bus[j][1];

}}

}

for (i=low;i<high;i++) {for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i] += noise2;bus[j][i] += noise2;

}}

for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i] += noise2;bus[j][i] += noise2;

} for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+1] += noise2;bus[j][i+1] += noise2;

}for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+2] += noise2;bus[j][i+2] += noise2;

}for (j=0;j<numf1s;j++) {noise1 = (double)(rand()&0xffff);noise2 = noise1/(double)0xffff;tds[j][i+3] += noise2;bus[j][i+3] += noise2;

}}

srand(10);for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {tds[j][i] = tds[j][1];tds[j][i] = bus[j][1];

}}

for (i=low;i<high;i=i+4) {for (j=0;j<numf1s;j++) {tds[j][i] = noise2;bus[j][i] = noise2;

}}

Original code Distilled code

179.art

Hint Phases

10

If we can predict these phases without actually running theprogram on both lightweight and aggressive cores, we canlimit the dual core execution only to the most useful phases

Performance(accelerated LWC) / Performance(original LWC)

Groups of 10K instr

Phase Prediction

11

Phase predictor : o does a decent job predicting the IPC trendo can sit either in the hypervisor or operating system and reads the

performance counters while the threads running

Aggressive core runs the thread that will benefit the most

Illusionist: Core Coupling Architecture

12

Agg

ress

ive

Cor

e

L1-Data

Shared L2 cacheRead-Only

Lightweight C

ore

L1-Data

Hint Gathering

FET

Memory Hierarchy

Queue

tail

head

DEC REN DIS EXE MEM COM

FE DE RE DI EX ME CO

Hint Distribution

L1-InstL1-Inst

Cache Fingerprint

Hint DisablingResynchronization

signal and hint disabling information

Illusionist System

13

L2 Cache Banks

L2 Cache Banks L2 Cache BanksData Switch

L2 Cache Banks

AggressiveCore

Queue

Hin

t Gat

herin

g

Queue Queue Queue

Lightweight Core

Queue Queue Queue Queue

Lightweight Core

Lightweight Core

Lightweight Core

Lightweight Core

Lightweight Core

Lightweight Core

Lightweight Core

Queue

Lightweight Core

Lightweight Core

Queue

Experimental Methodology

14

Performance : Heavily modified SimAlphao Instruction removal and phase-based program pruningo SPEC-CPU-2K with SimPoint

Power : Wattch, HotLeakage, and CACTIArea : Synopsys toolchain + 90nm TSMC

Performance After Acceleration

On average, 43% speedup compared to a LWC

15

Instruction Type Breakdown

In most benchmarks, the breakdowns are similar.

16

b: before distillation a: after distillation

17

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

All Aggressive Cores(ACs)

1 AC + 1 LWC After InstructionRemoval

After Phase-BasedPruning

All Lightweight Cores(LWCs)

Nor

mal

ized

to A

ll A

ggre

ssiv

e C

ores

System Throughput Power Average Single-Thread Performance Total Energy

Area-Neutral Comparison of Alternatives

More Lightweight Cores

34%

2X

Conclusion

18

On-demand acceleration of lightweight cores o using a few aggressive cores

Aggressive core keeps up with many LWCs by o Aggressive inst. removal with a minimal impact on the hintso Phase-based program pruning based on hint effectiveness

Illusionist provides an interesting design pointo Compared to a CMP with only lightweight cores

35% better single thread performance per threado Compared to a CMP with only aggressive cores

2X better system throughput

19

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

All Aggressive Cores(ACs)

1 AC + 1 LWC After InstructionRemoval

After Phase-BasedPruning

All Lightweight Cores(LWCs)

Nor

mal

ized

to A

ll A

ggre

ssiv

e C

ores

System Throughput Power Average Single-Thread Performance Total Energy

20

Comparison with Alternatives

More Lightweight Cores

1 6 10

number of available threads = 60% of the number of lightweight cores

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

All Aggressive Cores(ACs)

1 AC + 1 LWC After InstructionRemoval

After Phase-BasedPruning

All Lightweight Cores(LWCs)

Nor

mal

ized

to A

ll A

ggre

ssiv

e C

ores

System Throughput Power Average Single-Thread Performance Total Energy

21

Comparison with Alternatives

More Lightweight Cores

1 6 10

number of available threads = 30% of the number of lightweight cores

Percentage of Instruction Removed

22

Hint Accuracy after Instruction Removal

23