combining statistical and symbolic simulation mark oskin fred chong and matthew farrens dept. of...
TRANSCRIPT
Combining Statistical and Symbolic Simulation
Mark Oskin
Fred Chong and Matthew FarrensDept. of Computer Science
University of California at Davis
Overview
• HLS is a hybrid performance simulation– Statistical + Symbolic
• Fast
• Accurate
• Flexible
Motivation
Branch prediction accuracy
0.74 0.76 0.78 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94
IPC
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15 I-cache hit rate
I-cache miss penaltyBranch miss-predictpenalty
Basic block size
Dispatch bandwidth
Motivation
• Fast simulation– seconds instead of hours or days– Ideally is interactive
• Abstract simulation– simulate performance of unknown designs– application characteristics not applications
Outline
• Simulation technologies and HLS
• From applications to profiles
• Validation
• Examples
• Issues
• Conclusion
Design Flow with HLS
Cycle-by-Cycle
Simulation
HLS
Profile
Design Issue
Design Issue
Design Issue
Possible solution
EstimatePerformance
Traditional Simulation Techniques
• Cycle-by-cycle (Simplescalar, SimOS,etc.)
+ accurate
– slow
• Native emulation/basic block models (Atom, Pixie)
+ fast, complex applications
– useful to a point (no low-level modifications)
Statistical / Symbolic Execution
• HLS+ fast (near interactive)
+ accurate / – within regions
+ permits variation of low-level parameters
+ arbitrary design points / – use carefully
HLS: A Superscalar Statistical and Symbolic Simulator
L2
Cac
he
L1
I-ca
che
L1
D-c
acheM
ain
Mem
ory
BranchPredictor
Fet
ch U
nit
Ou
t of
ord
erD
ispa
tch
Un
it
Ou
t of
ord
erC
ompl
etio
n U
nit
Ou
t of
ord
erE
xecu
tion
cor
e
Statistical Symbolic
WorkflowCode
Binary
sim-stat
sim-outorderapp profile
Stat-binary
HLS
machine-profile
R10k
machine-configuration
Machine Configurations
• Number of Functional units (I,F,[L,S],B)
• Functional unit pipeline depths
• Fetch, Dispatch and completion bandwidths
• Memory access latencies
• Mis-speculation penalties
Profiles• Machine profile:
– cache hit rates => ()– branch prediction accuracy => ()
• Application profile:– basic block size => (,)– instruction mix (% of I,F,L,S,B)– dynamic instruction distance (histogram)
0
1020
3040
50
Integer FloatingPoint
Load Store Branch
Instruction TypePer
cen
t o
f to
tal D
ynam
ic
Dep
end
ence
Dis
tan
ce None1-19
20-100
Statistical Binary
• 100 basic blocks
• Correlated:– random instruction mix– random assignment of dynamic instruction
distance– random distribution of cache and branch
behaviors
Statistical Binary
load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0)
integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1)
integer (l1 i-cache, l2 i-cache, dependence 0, dependence 1)
branch (l1 i-cache, l2 i-cache, branch-predictor accr., dep 0, dep 1)
store (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dep 0, dep 1)
load (l1 i-cache, l2 i-cache, l1 d-cache l2 d-cache, dependence 0)
core functionalunit requirements
cache behaviorduring I-fetch cache behavior
during data access
dynamic instruction distancebranch predictor behavior
HLS Instruction Fetch Stage
integer (...)
branch (...)
store (...)
load (...)
integer (...)
branch (...)
load (...)
integer (..)
Similar to conventional instruction fetch:
- has a PC- has a fetch window- interacts with caches- utilizes branch predictor- passes instructions to dispatch
Differences:
- caches and branch predictor are statistical models
Fetches symbolic instructions and interacts with a statisticalmemory system and branch predictor model.
Validation - SimpleScalar vs. HLS
Brenchmark SimpleScalar IPC HLS IPC Errorperl 1.27 1.32 4.20%compress 1.18 1.25 5.50%gcc 0.92 0.96 3.90%go 0.94 1.01 6.80%ijpeg 1.67 1.73 3.90%li 1.62 1.5 7.20%m88ksim 1.16 1.14 1.50%vortex 0.87 0.83 5.10%
Validation - R10k vs. HLS
Brenchmark R10K HLS IPC Errorperl 1.01 1.09 7.00%compress 0.7 0.69 2.60%gcc 0.93 0.96 3.80%go 0.9 0.98 0.90%ijpeg 1.45 1.4 4.00%li 0.85 0.9 6.00%m88ksim 1.15 1.15 0.10%vortex 0.83 0.82 1.00%
1.61.5
1.41.3
1.2
1.1
1.0
0.9
0.8
0.7
0.6
Branch Prediction Accuracy
0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00
L1
Intr
uct
ion
Ca
che
Hit
Ra
te
0.80
0.82
0.84
0.86
0.88
0.90
0.92
0.94
0.96
0.98
1.002.0
1.91.81.71.6
1.5
1.5
1.4
1.4
1.3
1.3
1.2
1.2
1.1
1.1
1.0
1.00.9
0.90.8
0.80.7
0.70.6
0.6
0.5
0.5
HLS Multi-value Validation with SimpleScalar
HLSSimple-Scalar
(Perl)
HLS Multi-Value Validation with SimpleScalar
HLSSimple-Scalar
(Xlisp)
L1 Instruction Cache Hit rate
0.80 0.85 0.90 0.95 1.00
L1 In
stru
ctio
n C
ache
Mis
s P
enal
ty
2
4
6
8
10
12
14
16
18
20
1.3
1.4
1.21.11.0
0.90.8
0.70.6
0.5
0.4
0.3
0.2
1.5
1.5
1.4
1.4
1.4
1.3
1.3
1.3
1.2
1.2
1.2
1.1
1.1
1.1
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.2
Example use of HLS
Branch Prediction Accuracy
0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00
Bas
ic B
lock
Siz
e
10
20
30
40
50
1.3
1.3
1.3
1.2
1.2
1.2
1.1
1.11.01.00.9
0.90.80.80.7
0.70.6
An intuitive result:branch predictionaccuracy becomesless important (crossesfewer iso-IPC contourlines, as basic block sizeincrease).
(Perl)
Example use of HLS
Basic Block Size
2 4 6 8 10 12 14 16 18 20
Dyn
amic
Ins
truc
tion
Dis
tanc
e
2
4
6
8
10
12
14
16
18
20
1.4
1.4
1.4
1.3
1.3
1.3
1.3
1.3
1.21.2
1.2
1.2
1.2
1.2
1.1
1.1
1.1
1.1
1.0
1.0
1.0
1.0
0.9
0.9
0.9
0.9
0.8
0.8
0.8
0.8
0.7
0.7
0.7
0.7
Another intuitive result: gains in IPCdue to basic block size are front-loaded
(Perl)
Trade-off betweenfront-end (fetch/dispatch)and back-end (ILP)processor performance
Example use of HLS
% Value predicted instructions
0 1
Dyn
amic
Ins
truc
tion
Dis
tanc
e
2
4
6
8
10
12
14
16
18
20
1.2
1.2
1.21.1
This spaceintentionallyleft blank.
(Perl)
Related work
• R. Carl and J.E. Smith. Modeling superscalar processors via statistical simulation - PAID Workshop - June 1998.
• N. Jouppi. The non-uniform distribution of instruction-level and machine parallelism and its effect on performance. - IEEE Trans. 1989.
• D. Noonburg and John Shen. Theoretical modeling of superscalar processor performance - MICRO27 - November 1994.
Questions & Future Directions
• How important are different well-performing benchmarks anyway?– easily summarized– summaries are not precise => yet precise enough– Will the statistical+symbolic technique work for
poorly behaved applications?
• Will it extend to deeper pipelines and more real processors (i.e. Alpha, P6 architecture)?
Conclusion
• HLS: Statistical + Symbolic Execution– Intuitive design space exploration
• Fast
• Accurate
– Flexible
• Validated against cycle-by-cycle and R10k• Future work: deeper pipelines, more hardware
validations, additional domains• source code at: http://arch.cs.ucdavis.edu/~oskin