axcis: accelerating architectural exploration using canonical instruction segments

33
1 AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments Rose Liu & Krste Asanovi Computer Architecture Group MIT CSAIL

Upload: lazaro

Post on 12-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments. Rose Liu & Krste Asanovi ć Computer Architecture Group MIT CSAIL . Pareto-optimal designs on curve. CPI. Area. Simulation for Large Design Space Exploration. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

1

AXCIS: Accelerating Architectural Exploration usingCanonical Instruction Segments

Rose Liu & Krste AsanovicComputer Architecture GroupMIT CSAIL

Page 2: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

2 of 31

Large design space studies explore thousands of processor designs Identify those that minimize costs and maximize performance

Speed vs. Accuracy tradeoff Maximize simulation speedup while maintaining sufficient

accuracy to identify interesting design points for later detailed simulation

Simulation for Large Design Space Exploration

Pareto-optimaldesigns on curve

Area

CPI

Page 3: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

3 of 31

Reduce Simulated Instructions: Sampling Perform detailed microarchitectural simulation during

sample points & functional warming between sample points SimPoints [ASPLOS, 2002], SMARTS [ISCA, 2003]

Use efficient checkpoint techniques to reduce simulation time to minutes TurboSMARTS [SIGMETRICS, 2005], Biesbrouck [HiPEAC, 2005]

Sample points – simulate in detail

Page 4: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

4 of 31

Generate a short synthetic trace (with statistical properties similar to original workload) for simulation

Eeckhout [ISCA, 2004], Oskin [ISCA, 2000] Nussbaum [PACT, 2001]

Reduce Simulated Instructions: Statistical Simulation

Execution Driven

ProfilingStatisticalImage

ProgramSynthetic

TraceGeneration

SyntheticTrace

Simulation IPCConfig

Stage 1 Stage 2

Stage 3

Page 5: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

5 of 31

AXCIS Framework

Dynamic TraceCompressor

Program&

Inputs

IPC1IPC2IPC3

AXCISPerformanceModel

CISTCanonicalInstructionSegment

Table

ConfigsIn-order superscalars:• Issue width• # of functional units• # of cache primary- miss tags• Latencies• Branch penalty

• Machine independent except for branch predictor and cache organizations

• Stores all information needed for performance analysis

Stage 1 (performed once)

Stage 2

Page 6: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

6 of 31

In-Order Superscalar Machine Model(size & penalty)

(latency). . .

Branch Pred.

FPU LSUALU

(issue width)

Fetch

Issue

Completion

BlockingIcache

Non-blockingDcache

(# primary miss tags)

Memory

(number of units)

(organization & latency)

(org. & latency)

(latency)

( ) Parameters

Page 7: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

7 of 31

Stage 1: Dynamic Trace Compression

Dynamic TraceCompressor

Program&

Inputs

IPC1IPC2IPC3

AXCISPerformanceModel

CISTCanonicalInstructionSegment

Table

Configs

Stage 1 (performed once)

Stage 2

Page 8: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

8 of 31

Instruction Segments

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

instruction segment

defining instruction

Events: (dcache, icache, bpred)

An instruction segment captures all performance-critical information associated with a dynamic instruction

Page 9: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

9 of 31

Instruction Segments

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

instruction segment

defining instruction

Events: (dcache, icache, bpred)

An instruction segment captures all performance-critical information associated with a dynamic instruction

Page 10: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

10 of 31

Dynamic Trace Compression Program behavior repeats due to loops, and

repeated function calls Multiple different dynamic instruction segments can

have the same behavior (canonically equivalent) regardless of the machine configuration

Compress the dynamic trace by storing in a table: 1 copy of each type of segment How often we see it in the dynamic trace

Page 11: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

11 of 31

Canonical Instruction Segment Table

Freq Segment

Int_ALU1Int_ALU

CIST

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Page 12: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

12 of 31

Canonical Instruction Segment Table

Freq Segment

Int_ALU1

CIST

Load_Miss

Int_ALU

Int_ALU

Load_Miss1

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Page 13: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

13 of 31

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Canonical Instruction Segment Table

Freq Segment

Int_ALU1

CIST

Int_ALU

Load_Miss1

Load_Miss

Int_ALULoad_Miss

Int_ALU1

Page 14: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

14 of 31

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Canonical Instruction Segment Table

Freq Segment

Int_ALU1

CIST

Int_ALU

Load_Miss1

Load_Miss

Int_ALU1Load_Miss

Int_ALU2

Page 15: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

15 of 31

Canonical Instruction Segment Table

Freq Segment

Int_ALU1

CIST

Int_ALU

Load_Miss1

Load_Miss

Int_ALU1 2Load_Miss

Int_ALU

2addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Page 16: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

16 of 31

addq (--, hit, correct)

ldq (miss, hit, correct)

subq (--, hit, correct)

stq (miss, hit, correct)

ldq (miss, hit, correct)

addq (--, hit, correct)

Canonical Instruction Segment Table

Freq Segment

Int_ALU1

CIST

Int_ALU

Load_Miss1

Load_Miss

Int_ALU1

Int_ALU

Load_Miss

Store_Miss

Load_Miss

1Store_Miss

Int_ALU

2

2

Total ins: 6

Page 17: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

17 of 31

Stage 2: AXCIS Performance Model

Dynamic TraceCompressor

Program&

Inputs

IPCAXCISPerformanceModel

CISTCanonicalInstructionSegment

Table

Config

Stage 1 (performed once)

Stage 2

Page 18: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

18 of 31

AXCIS Performance Model Calculates IPC using a single linear dynamic

programming pass over the CIST entries Total work is proportional to the # of CIST entries

Stalls Effective Total Ins TotalIns Total

Cycles TotalIns Total

IPC

Size CIST

1)ningIns(i)talls(DefiEffectiveS * Freq(i)

iStalls Effective Total

EffectiveStalls = MAX ( stalls(DataHazards), stalls(StructuralHazards), stalls(ControlFlowHazards) )

Page 19: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

19 of 31

Performance Model Calculations

Int_ALU

Freq Segment

Int_ALU

Load_Miss

Load_Miss

Int_ALU

1

2

2

Total ins: 6Look up in previous segmentCalculate

For each defininginstruction: Calculate its effective stalls & its corresponding microarchitecture state snapshot Follow dependencies to look up the effective stalls & state of other instructions in previous entries

1Load_Miss

Store_Miss

Int_ALU

Stalls

0

2

99

99

???

State

???

Page 20: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

20 of 31

Stall Cycles From Data Hazards

1Load_Miss

Store_Miss

Int_ALU 99

Input configuration:

100Load_Miss 3Int_ALU

Latency (cycles)Ins Type

Freq

Use data dependencies (e.g. RAW) to detect data hazards Stalls(DataHazards)

= MAX ( -1, Latency( producer = Load_Miss )

– DepDist – EffectiveStalls( IntermediateIns = Int_ALU ) )

= MAX (-1, (100 – 2 – 99) ) = -1 stalls (can issue with previous instruction)

???

Stalls…

State

???

Page 21: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

21 of 31

Stall Cycles from Structural Hazards

CISTs record special dependencies to capture all possible structural hazards across all configurations

The AXCIS performance model follows these special dependencies to find the necessary microarchitectural states to:

1. Determine if a structural hazard exists & the number of stall cycles until it is resolved

2. Derive the microarchitectural state after issuing the current defining instruction

1Load_Miss

Store_Miss

Int_ALU

Freq Microarchitectural State

???

Stalls…

???

99

Page 22: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

22 of 31

Stall Cycles From Control Flow Hazards

Control flow events directly map to stall cycles

1Load_Miss

Store_Miss

Int_ALU

Freq Icache Branch Pred.

… …

hit correct & not taken

… …

Icache Bpred StallsHit Incorrect & taken/not taken

Correct & takenCorrect & not taken

Mispred penalty0-1

Miss Incorrect & taken/not takenCorrect & takenCorrect & not taken

Memory latency + mispred penaltyMemory latencyMemory latency - 1

Page 23: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

23 of 31

Lossless Compression Scheme Lossless Compression Scheme: (perfect accuracy)

Compress two segments if they always experience the same stall cycles regardless of the machine configuration

Impractical to implement within the Dynamic Trace Compressor

addq (--, hit, correct)

ldiq (--, hit, correct)

stq (miss, hit, correct)

ldiq always Issues with addq

addq (--, hit, correct)

stq (miss, hit, correct)

Page 24: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

24 of 31

Three Compression Schemes Instruction Characteristics Based Compression:

Compress segments that “look” alike (i.e. have the same length, instruction types, dependence distances, branch and cache behaviors)

Limit Configurations Based Compression: Compress segments whose defining instructions have the same

instruction types, stalls and microarchitectural state under the 2 configurations simulated during trace compression

Relaxed Limit Configurations Based Compression: Relaxed version of the limit-based scheme – does not compare

microarchitectural state Improves compression at the cost of accuracy

Page 25: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

25 of 31

Experimental Setup Evaluated AXCIS against a baseline cycle accurate

simulator on 24 SPEC2K benchmarks Evaluated AXCIS for:

Accuracy:

Speed: # of CIST entries, time in seconds For each benchmark, simulated a wide range of designs:

Issue width: {1, 4, 8}, # of functional units: {1, 2, 4, 8}, Memory latency: {10, 200 cycles}, # of primary miss tags in non-blocking data cache: {1, 8}

For each benchmark, selected the compression scheme that provides the best compression given a set accuracy range

Absolute IPC Error =| AXCIS – Baseline |

Baseline* 100

Page 26: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

26 of 31

0%

10%

20%

30%

Abs

olut

e IP

C E

rror

P_25P_MINP_50P_MAXP_75ave. IPC error

Limit-based Scheme Relaxed Limit-based Scheme

Characteristics-based Scheme

Results: AccuracyDistribution of IPC Error in quartiles

High Absolute Accuracy: Average Absolute IPC Error = 2.6 % Small Error Range: Average Error Range = 4.4%

Page 27: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

27 of 31

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12Configuration

Ave

rage

IPC

Ave IPC - BaselineAve IPC - AXCIS

Results: Relative AccuracyAverage IPC of Baseline and AXCIS

High Relative Accuracy: AXCIS and Baseline provide the same ranking of configurations

Page 28: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

28 of 31

0

500

1000

1500

2000

2500

3000

thou

sand

s#

of C

IST

Entr

ies

in

2.26 sec

0.3 sec

0.07sec

0.09sec

0.15 sec

0.05 sec

0.08 sec 0.02

sec

0.88sec

0.55 sec 0.25

sec

2.74 sec

1.09sec

3.1sec

1.32sec 0.69

sec0.06sec

0.72sec

0.07sec

5.56sec

4.5sec

7.32sec

17.4sec

17.55sec

Limit-based Scheme Relaxed Limit-based Scheme

Characteristics-based Scheme

Results: Speed# of CIST entries & modeling time AXCIS is over 4

orders of magnitude faster than detailed simulation

CISTs are 5 orders of magnitude smaller than the original dynamic trace, on average

Modeling time ranged from 0.02 – 18 secondsfor billions of dynamic instructions

Page 29: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

29 of 31

Discussion Trade the generality of CISTs for higher accuracy

and/or speed E.g. fix the issue width to 4 and explore near this design point

Tailor the tradeoff made between speed/compression and accuracy for different workloads Floating point benchmarks (repetitive & compress well)

More sensitive to any error made during compression Require compression schemes with a stricter segment

equality definition Integer benchmarks: (less repetitive & harder to compress)

Require compression schemes that have a more relaxed equality definition

Page 30: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

30 of 31

Future Work Compression Schemes:

How to quickly identify the best compression scheme for a benchmark?

Is there a general compression scheme that works well for all benchmarks?

Extensions to support Out-of-Order Machines: Main ideas still apply (instruction segments, CIST, compression

schemes) Modify performance model to represent dispatch, issue, and

commit stages within the microarchitectural state so that given some initial state & an instruction, it can calculate the next state

Page 31: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

31 of 31

AXCIS is a promising technique for exploring large design spaces

High absolute and relative accuracy across a broad range of designs

Fast: 4 orders of magnitude faster than detailed simulation Simulates billions of dynamic instructions within seconds

Flexible: Performance modeling is independent of the compression

scheme used for CIST generation Vary the compression scheme to select a different tradeoff

between speed/compression and accuracy Trade the generality of the CIST for increased speed and/or

accuracy

Conclusion

Page 32: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

32

Backup Slides

Page 33: AXCIS: Accelerating Architectural Exploration using Canonical Instruction Segments

33 of 31

Results: Relative Accuracy

0

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5 6

Configuration

Ave

rage

IPC

Ave IPC - Baseline

Ave IPC - AXCIS

Average IPC of Baseline and AXCIS over all benchmarks