combining statistical and symbolic simulation mark oskin fred chong and matthew farrens dept. of...

Combining Statistical and Symbolic Simulation

Mark Oskin

Fred Chong and Matthew FarrensDept. of Computer Science

University of California at Davis

Overview

• HLS is a hybrid performance simulation– Statistical + Symbolic

• Fast

• Accurate

• Flexible

Motivation

Branch prediction accuracy

0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94

IPC

0.80

0.85

0.90

0.95

1.00

1.05

1.10

1.15 I-cache hit rate

I-cache miss penaltyBranch miss-predictpenalty

Basic block size

Dispatch bandwidth

Motivation

• Fast simulation– seconds instead of hours or days– Ideally is interactive

• Abstract simulation– simulate performance of unknown designs– application characteristics not applications

Outline

• Simulation technologies and HLS

• From applications to profiles

• Validation

• Examples

• Issues

• Conclusion

Design Flow with HLS

Cycle-by-Cycle

Simulation

HLS

Profile

Design Issue

Design Issue

Design Issue

Possible solution

EstimatePerformance

Traditional Simulation Techniques

• Cycle-by-cycle (Simplescalar, SimOS,etc.)

+ accurate

– slow

• Native emulation/basic block models (Atom, Pixie)

+ fast, complex applications

– useful to a point (no low-level modifications)

Statistical / Symbolic Execution

• HLS+ fast (near interactive)

+ accurate / – within regions

+ permits variation of low-level parameters

+ arbitrary design points / – use carefully

HLS: A Superscalar Statistical and Symbolic Simulator

L2

Cac

he

L1

I-ca

che

L1

D-c

acheM

ain

Mem

ory

BranchPredictor

Fet

ch U

nit

Ou

t of

ord

erD

ispa

tch

Un

it

Ou

t of

ord

erC

ompl

etio

n U

nit

Ou

t of

ord

erE

xecu

tion

cor

e

Statistical Symbolic

WorkflowCode

Binary

sim-stat

sim-outorderapp profile

Stat-binary

HLS

machine-profile

R10k

machine-configuration

Machine Configurations

• Number of Functional units (I,F,[L,S],B)

• Functional unit pipeline depths

• Fetch, Dispatch and completion bandwidths

• Memory access latencies

• Mis-speculation penalties

Profiles• Machine profile:

– cache hit rates => ()– branch prediction accuracy => ()

• Application profile:– basic block size => (,)– instruction mix (% of I,F,L,S,B)– dynamic instruction distance (histogram)

0

1020

3040

50

Integer FloatingPoint

Load Store Branch

Instruction TypePer

cen

t o

f to

tal D

ynam

ic

Dep

end

ence

Dis

tan

ce None1-19

20-100

Statistical Binary

• 100 basic blocks

• Correlated:– random instruction mix– random assignment of dynamic instruction

distance– random distribution of cache and branch

behaviors

Statistical Binary

load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0)

integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1)

integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1)

branch (l1 i-cache, l2 i-cache, branch-predictor accr., dep 0, dep 1)

store (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dep 0, dep 1)

load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0)

core functionalunit requirements

cache behaviorduring I-fetch cache behavior

during data access

dynamic instruction distancebranch predictor behavior

HLS Instruction Fetch Stage

integer (...)

branch (...)

store (...)

load (...)

integer (...)

branch (...)

load (...)

integer (..)

Similar to conventional instruction fetch:

- has a PC- has a fetch window- interacts with caches- utilizes branch predictor- passes instructions to dispatch

Differences:

- caches and branch predictor are statistical models

Fetches symbolic instructions and interacts with a statisticalmemory system and branch predictor model.

Validation - SimpleScalar vs. HLS

Brenchmark SimpleScalar IPC HLS IPC Errorperl 1.27 1.32 4.20%compress 1.18 1.25 5.50%gcc 0.92 0.96 3.90%go 0.94 1.01 6.80%ijpeg 1.67 1.73 3.90%li 1.62 1.5 7.20%m88ksim 1.16 1.14 1.50%vortex 0.87 0.83 5.10%

Validation - R10k vs. HLS

Brenchmark R10K HLS IPC Errorperl 1.01 1.09 7.00%compress 0.7 0.69 2.60%gcc 0.93 0.96 3.80%go 0.9 0.98 0.90%ijpeg 1.45 1.4 4.00%li 0.85 0.9 6.00%m88ksim 1.15 1.15 0.10%vortex 0.83 0.82 1.00%

1.61.5

1.41.3

1.2

1.1

1.0

0.9

0.8

0.7

0.6

Branch Prediction Accuracy

0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00

L1

Intr

uct

ion

Ca

che

Hit

Ra

te

0.80

0.82

0.84

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.002.0

1.91.81.71.6

1.5

1.5

1.4

1.4

1.3

1.3

1.2

1.2

1.1

1.1

1.0

1.00.9

0.90.8

0.80.7

0.70.6

0.6

0.5

0.5

HLS Multi-value Validation with SimpleScalar

HLSSimple-Scalar

(Perl)

HLS Multi-Value Validation with SimpleScalar

HLSSimple-Scalar

(Xlisp)

L1 Instruction Cache Hit rate

0.80 0.85 0.90 0.95 1.00

L1 In

stru

ctio

n C

ache

Mis

s P

enal

ty

2

4

6

8

10

12

14

16

18

20

1.3

1.4

1.21.11.0

0.90.8

0.70.6

0.5

0.4

0.3

0.2

1.5

1.5

1.4

1.4

1.4

1.3

1.3

1.3

1.2

1.2

1.2

1.1

1.1

1.1

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.2

Example use of HLS

Branch Prediction Accuracy

0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00

Bas

ic B

lock

Siz

e

10

20

30

40

50

1.3

1.3

1.3

1.2

1.2

1.2

1.1

1.11.01.00.9

0.90.80.80.7

0.70.6

An intuitive result:branch predictionaccuracy becomesless important (crossesfewer iso-IPC contourlines, as basic block sizeincrease).

(Perl)

Example use of HLS

Basic Block Size

2 4 6 8 10 12 14 16 18 20

Dyn

amic

Ins

truc

tion

Dis

tanc

e

2

4

6

8

10

12

14

16

18

20

1.4

1.4

1.4

1.3

1.3

1.3

1.3

1.3

1.21.2

1.2

1.2

1.2

1.2

1.1

1.1

1.1

1.1

1.0

1.0

1.0

1.0

0.9

0.9

0.9

0.9

0.8

0.8

0.8

0.8

0.7

0.7

0.7

0.7

Another intuitive result: gains in IPCdue to basic block size are front-loaded

(Perl)

Trade-off betweenfront-end (fetch/dispatch)and back-end (ILP)processor performance

Example use of HLS

% Value predicted instructions

0 1

Dyn

amic

Ins

truc

tion

Dis

tanc

e

2

4

6

8

10

12

14

16

18

20

1.2

1.2

1.21.1

This spaceintentionallyleft blank.

(Perl)

Related work

• R. Carl and J.E. Smith. Modeling superscalar processors via statistical simulation - PAID Workshop - June 1998.

• N. Jouppi. The non-uniform distribution of instruction-level and machine parallelism and its effect on performance. - IEEE Trans. 1989.

• D. Noonburg and John Shen. Theoretical modeling of superscalar processor performance - MICRO27 - November 1994.

Questions & Future Directions

• How important are different well-performing benchmarks anyway?– easily summarized– summaries are not precise => yet precise enough– Will the statistical+symbolic technique work for

poorly behaved applications?

• Will it extend to deeper pipelines and more real processors (i.e. Alpha, P6 architecture)?

Conclusion

• HLS: Statistical + Symbolic Execution– Intuitive design space exploration

• Fast

• Accurate

– Flexible

• Validated against cycle-by-cycle and R10k• Future work: deeper pipelines, more hardware

validations, additional domains• source code at: http://arch.cs.ucdavis.edu/~oskin

combining statistical and symbolic simulation mark oskin fred chong and matthew farrens dept. of...

Documents

l2 icache

branch l1 icache

hls slide

l1 dcache l2 dcache

integer l1 icache

store l1 icache

fetch cache behavior

symbolic simulator l2