a mechanistic model for superscalar processors a mechanistic model for superscalar processors j. e....
TRANSCRIPT
A Mechanistic Model for A Mechanistic Model for Superscalar ProcessorsSuperscalar Processors
J. E. SmithUniversity of Wisconsin-Madison
Lieven Eeckhout, Stijn EyermanGhent University
Tejas KarkhanisAMD
Superscalar Modeling © J. E. Smith, 2006 2
Interval AnalysisInterval Analysis
Superscalar execution can be divided into intervals separated by miss events
• Branch miss predictions• I cache misses• Long D cache misses• TLB misses, etc.
Provides more insight than simulation • You can see the forest and the trees• Supplements simulation, not a replacement
time
IPC
branchmispredicts
i-cachemiss long d-cache miss
interval 1 interval 2 interval 3interval 0
Superscalar Modeling © J. E. Smith, 2006 3
OutlineOutline
Development of Interval Analysis • Modeling ILP• Modeling miss events
Balanced Superscalar Processors• Performance components• Optimal pipeline configurations
Performance Counter Architecture• Accurate CPI stacks
Superscalar Modeling © J. E. Smith, 2006 4
Superscalar ProcessorsSuperscalar Processors
I-cache Decode PipelineIssueBuffer
Exec.Unit
Exec.Unit
Exec.Unit
Reorder Buffer (Window)
PhysicalRegisterFile(s)
F D D I
MSHRs
D
R
BranchPredict
Fetchbuffer
# entries
# entries
miss rate
W entries
# entries
# entries
# and type of unitsunit latencies
Pipeline depth
instructiondelivery
algorithm
miss-rate
mispredictrate
Store Q
Load Q# entries L1 Data
Cache#ports
L2Cache
miss rate
toI-cache
mainmemorylatency
Superscalar Modeling © J. E. Smith, 2006 5
Superscalar ProcessorsSuperscalar Processors Ifetch
• Adequate fetch resources to sustain decode/dispatch width D• F > D plus fetch buffer to smooth flow
Decode• Assume decode pipe and dispatch bandwidth D
Window• Window, size W, holds in-flight instructions• Equivalent to ROB• Issue buffer holds subset of window (as an optimization)• Assume unified issue buffer, but model can support partitioned buffers
Issue• Width may be more or less than dispatch and commit widths
Retire• Retire width R typically equal to dispatch width
Superscalar Modeling © J. E. Smith, 2006 6
Superscalar Processor PerformanceSuperscalar Processor Performance
Maximum IPC under ideal conditions• No cache misses or branch mispredictions
Miss-events disrupt smooth flow• In balanced design, performance is all about the transients
time
IPC
branchmispredicts
i-cachemiss
long d-cachemiss
Superscalar Modeling © J. E. Smith, 2006 7
Modeling ILPModeling ILP
Relationship between maximum window size W and achieved issue width i
Program dependence structure Has a long history…
Superscalar Modeling © J. E. Smith, 2006 8
Riseman and Foster (1972)Riseman and Foster (1972)
Basic relationship between window size and IPC
• Classic Study• Approx quadratic
relationship under ideal conditions
Wi
Superscalar Modeling © J. E. Smith, 2006 9
Wall (1991)Wall (1991)
Limits of ILP• Another classic study• Approx. quadratic
relationship under “perfect” conditions
Superscalar Modeling © J. E. Smith, 2006 10
Michaud, Seznec, JourdanMichaud, Seznec, Jourdan
More recent study Key Result (Michaud, Seznec, Jourdan):
• Approx. quadratic relationship
Superscalar Modeling © J. E. Smith, 2006 11
Our ExperimentOur Experiment
Ideal caches, predictor Efficient I fetch keeps window full Graph issue rate i, as a fcn of window size W
•Approx. quadratic relationship
Superscalar Modeling © J. E. Smith, 2006 12
Modeling IW CharacteristicModeling IW Characteristic
Clearly a function of program dependence structure Simple, single-level dependence models don’t work
very well• Need to consider dependence chains
Slide window over dynamic stream and compute average critical path k(W)
For unit latency, i = W/k(W)
Window
Dynamic InstructionStream
Superscalar Modeling © J. E. Smith, 2006 13
Average Critical PathAverage Critical Path
For our benchmarks, 1.3 ≤ β ≤ 1.9• Quadratic when β=2
Unit latency avg. IPC
Avg. latency l, avg. IPC
/11)( WWk
/11 Wi
/111 Wli1/)/( liW
Superscalar Modeling © J. E. Smith, 2006 14
Generic IntervalGeneric Interval
All intervals follow same basic profile
Time (in Cycles)
Instructionsper Cycle
ramp-up asinstructions
enter window
time dependenton type of miss
event
ramp-down aswindow drains
transient due tomiss-event
Superscalar Modeling © J. E. Smith, 2006 15
I Cache Miss IntervalI Cache Miss Interval
total time = n/D + ciL1
n = no. instructions in interval
D = decode/dispatch width
cIL1 = miss delay cycles Predicts performance loss is
independent of pipe length
re-fillpipeline
miss delay
windowdrains
time= n/D
Superscalar Modeling © J. E. Smith, 2006 16
Independence from Pipe LengthIndependence from Pipe Length
16 K I-cache; ideal D-cache and predictor Two different pipeline lengths (4 and 8 cycles) I-cache miss delay 8 cycles Penalty independent of pipe length Similar across benchmarks
0.0
8.0
bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr
cycl
es
4 front-end stages 8 front-end stages
Superscalar Modeling © J. E. Smith, 2006 17
Branch Miss Prediction IntervalBranch Miss Prediction Interval
Total time = n/D + cdr (D) + cfe n = no. instructions in intervalD = decode/dispatch widthcdr (D) = drain cycles; function of width
(and ILP)cfe = front-end pipeline length
time = n/D time= pipeline length
time= branch latency
window drain time
Superscalar Modeling © J. E. Smith, 2006 18
Branch Resolution TimeBranch Resolution Time
Assumes mispredicted branch is one of the last instructions to issue
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
bzip
2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
per
cent
age
>5
5
4
3
2
1
0
Superscalar Modeling © J. E. Smith, 2006 19
Branch Miss Prediction PenaltyBranch Miss Prediction Penalty
Branch penalty is dependent on interval length
The penalty can be 2+ times pipeline length
Penalty is less for short intervals; more for long intervals
See ISPASS ’06 paper for more details
Superscalar Modeling © J. E. Smith, 2006 20
Long D-cache Miss IntervalLong D-cache Miss Interval
Loadenters
window
ROB fills
Data returns frommemory
steady state
Instructionsenter window
issue rampsup to
steady state
time = n/D
Issue window emptyof issuable insns
Loadissues
miss latency
ROB fill time
Loadresolution
time
Superscalar Modeling © J. E. Smith, 2006 21
Long D-cache Miss IntervalLong D-cache Miss Interval
For isolated miss total time = n/D - W/D + cLr (D) + cL2
n = no. instructions in intervalD = decode/dispatch widthW = window (ROB) sizecLr (D) = load resolution time; function of widthcL2 = L2 miss delay
Loadenterswindow
ROB fills
Data returns frommemory
steady state
Instructionsenter window
issue rampsup to
steady state
time = N/d
Issue window emptyof issuable insns
Loadissues
miss latency
ROB fill time
Loadresolution
time
Superscalar Modeling © J. E. Smith, 2006 22
Miss Event OverlapsMiss Event Overlaps
Branch Misprediction and I-Cache Miss effects “serialize”
• i.e. penalties add linearly Long D-Cache Misses may overlap with I-cache and
B-predict misses (and with each other)• Overlap with other long D-cache misses more important• Overlaps with branch mispredictions and I-cache misses are
insignificant
BranchMispredicts
I-Cache Misses
Long D-CacheMisses
Superscalar Modeling © J. E. Smith, 2006 23
Overlapping Long D-cache MissesOverlapping Long D-cache Misses
s/D reflects amount of overlap Total penalty is independent of s/D
1st loadenterswindow
ROB fills
Load 1data returns from
memory
time = n/D
Issue window emptyof issuable insns
1st loadissues
miss latency
s/D
2nd loadissues
s/D
Superscalar Modeling © J. E. Smith, 2006 24
Experimental ResultsExperimental Results
For each long miss, collect stats on other misses within a “ROB distance”
• This is a trace statistic• Assume W/D = cLr
0.0
50.0
100.0
150.0
200.0
bzip crafty eon gap gcc gzip mcf parser perl twolf vortex vpr
cycl
es
Simulation Analytical Model
Superscalar Modeling © J. E. Smith, 2006 25
Overall PerformanceOverall Performance
Sum over all intervals
I cache miss interval: n/D + cic
Branch mispredict: n/D + cdr + cfe
Long d-cache miss: n/D - W/D + cLr + cL2
(non-overlapping)
Collect the n/D terms:
Ntotal/D Account for “ceiling inefficiency”
((D-1)/2D)*(miL1 + mbr + mL2)
Superscalar Modeling © J. E. Smith, 2006 26
Overall PerformanceOverall Performance
Total Cycles = Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)
+ mic * ciL1
+ mbr * (cdr + cfe)
+ mL2 * (- W/D + clr + cL2)
TLB misses similar to L2 misses
Superscalar Modeling © J. E. Smith, 2006 27
AccuracyAccuracy
Decode Width, D=4Average error 4.2%; max 8.6%
D=2, error = 1.8%D=6, error = 5.6%D=8, error = 5.6%
Superscalar Modeling © J. E. Smith, 2006 28
Decode EfficiencyDecode Efficiency
Compare with simulation• D = 4
mcf dominated by intervals of length 5 and 13
• Less efficient than model would predict
This is an inherent inefficiency due to intervals
• Strongly correlates w/ interval lengths
Superscalar Modeling © J. E. Smith, 2006 29
Convert From Cycles to TimeConvert From Cycles to Time
Important if pipeline depth is to be modeled• latch overheads become important
Start with baseline 5 stage front-end• pb = #pipeline stages in baseline
Allow for arbitrary number of stages• p = #pipeline stages• Increase all latencies proportionate to relative depth
Multiply cycles by p/pb
Convert total cycles to total time• tp = total pipeline latency; to = latch overhead
• cycle time = tp / p + to
Superscalar Modeling © J. E. Smith, 2006 30
Convert to Absolute TimeConvert to Absolute Time
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
Now, consider some of the terms in isolation
Superscalar Modeling © J. E. Smith, 2006 31
Base TPI + One Linear Miss EventBase TPI + One Linear Miss Event
Component TPI
0
0.2
0.4
0.6
0.8
1
1.2
5 10 15 20 25 30 35
Pipeline Stages
TP
I
width 2
width 4
width6
width8
miss event
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
Total TPI
0.6
0.8
1
1.2
1.4
5 10 15 20 25 30 35
Pipeline Stages
T
PI
width 2
width 4
width6
width8
miss event
Superscalar Modeling © J. E. Smith, 2006 32
Pipelining of Miss EventsPipelining of Miss Events
Fully Pipelined Unit
0
1
2
3
4
5
5 10 15 20 25 30 35
Pipeline Stages
T
PI
miss event
miss event x 2
miss event x 3
miss event x 4
Not all paths are fully pipelined• e.g. cache misses may not be fully pipelined• A pipeline factor (0 ≤ f ≤ 1) can be added to a term• Example: I cache miss
mic * ciL1*(p/pb)* (tp / p + fiL1 to)
Changing Pipeline Factor
0.6
0.7
0.8
0.91
1.1
1.2
5 10 15 20 25 30 35
Pipeline Stages
T
PI
pipelined
pipelined .5
pipelined .25
nonpipelined
Superscalar Modeling © J. E. Smith, 2006 33
Fetch InefficiencyFetch Inefficiency
Inherent fetch inefficiency • Due to presence of misses• As opposed to structural inefficiency• More important for wider pipelines
[Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
Effect of Inefficiency
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7
Pipeline Stages
TP
I
width 2
width 4
width6
width8
w2+ovhd
w4+inherent
w6+inherent
w8+inherent
Superscalar Modeling © J. E. Smith, 2006 34
Miss Events Dependent on ROB SizeMiss Events Dependent on ROB Size Miss events are dependent on ROB size
• And therefore dependent on depth/width for balanced designs Branch mispredicts go up due to late update of predictor L2 miss behavior may be better or worse depending on overlaps
• Deeper pipeline longer miss penalty• Longer ROB more MLP
Superscalar Modeling © J. E. Smith, 2006 35
Balanced Superscalar Processor DesignBalanced Superscalar Processor Design
Definition: At iW balance point:
• Under ideal conditions, achieved issue width i = I, but decreasing W means achieved issue width diminishes..
• For practical issue widths, there is enough ILP that balance can be achieved (See earlier work)
• Balance does not imply overall width/depth optimality Provide adequate numbers of other resources
• Issue buffer, load/store buffers, rename regs., functional units, etc.• Reducing resources below adequate level causes reduced performance
Superscalar Modeling © J. E. Smith, 2006 36
Balanced Superscalar Processor DesignBalanced Superscalar Processor Design
Choose Width/Depth Optimize other elements based on Width/Depth
IssueWidth
I-FetchResources
(aciheved width)Commit Width ROB Size
Beta (~ quadratic)Relationship
# RenameRegisters
Load/StoreBuffer Sizes
Numbers ofFunctional Units
Issue BufferSize
LinearRelationship Linear
RelationshipLinear
Relationship
LinearRelationships
PipelineDepth
Beta (~ quadratic)Relationship
Inverse RelationshipAt optimal point, widerissue implies shallower
pipeline
Superscalar Modeling © J. E. Smith, 2006 37
Optimize Pipeline DepthOptimize Pipeline Depth
Start with baseline 5 stage front-end• pb = #pipeline stages in baseline
Evaluate 1x, 2x, 3x, 4x, 5x depths• Increase all latencies proportionate to depths• Multiply by p/pb
Convert total cycles to total time• cycle time = tp / p + to
• p = # stages; tp = total pipeline latency; to = latch overhead
Total Time = [Ntotal/D + ((D-1)/2D)*(miL1 + mbr + mL2)]* (tp / p + to)
+ mic * ciL1*(p/pb)* (tp / p + to)
+ mbr * (cdr(p,D) + cfe) *(p/pb)* (tp / p + to)
+ mL2 * (- W/Dp + clr(p,D)+ cL2) *(p/pb)* (tp / p + to)
TPI = Total Time/Ntotal
Superscalar Modeling © J. E. Smith, 2006 38
Pipeline Depth ResultsPipeline Depth Results
Use tp/to = 55 as in Hartstein and Puzak
• Also illustrates accuracy of model• Consider four typical benchmarks:
Superscalar Modeling © J. E. Smith, 2006 39
Pipeline Depth ResultsPipeline Depth Results
On average, 2X baseline pipeline depth is optimal Consistent w/ H&P
Superscalar Modeling © J. E. Smith, 2006 40
Optimize Pipeline WidthOptimize Pipeline Width
In general wider means higher performance (to 8-wide) Optimal depth becomes shallower as width grows Diminishing returns w/ wider pipelines
• 4 vs. 2 13.3%; 6 vs. 4 7.1%; 8 vs. 6 2.9%
Superscalar Modeling © J. E. Smith, 2006 41
Short Interval EffectsShort Interval Effects With short intervals, may never reach peak issue rate Example: assume 1 mispredict every 96 instructions
• E.g. SPEC benchmark crafty with 4K gshare• Max issue rate never reached for D = 6,8
Yet, there is a benefit from wider pipelines
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60
Cycle
IPC
D=8
D=6
D=4
D=2
Superscalar Modeling © J. E. Smith, 2006 42
Benefit Does Not Come From Benefit Does Not Come From IssueIssue Width Width Benefit comes from wider decode/dispatch width
• Get to next I-cache miss sooner• Resolve branch mispredicts sooner• Benefit comes from faster ramp-up• D = 8 faster than D = 6• D = 8, I =6 gives same performance as D = 8, I = 8
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60
Cycle
IPC
D=8
D=6
D=4
D=2
Superscalar Modeling © J. E. Smith, 2006 43
Potential High Perf ProcessorPotential High Perf Processor
Widen Fetch, Decode, Retire• Keep relatively narrow issue
Lengthen ROB• And related structures
I-cache Decode PipelineIssueBuffer
Exec.Unit
Exec.Unit
Exec.Unit
Reorder Buffer (Window)
PhysicalRegister
File(s)
F D D I
D
R
BranchPredict
Fetchbuffer
# entries
# entries
miss rate
W entries
# entries
# and type of unitsunit latencies
Pipeline depth
instructiondelivery
algorithm
miss-rate
mispredictrate
Store Q
Load Q# entries L1 Data
Cache#ports
L2Cache
miss rate
toI-cache
mainmemorylatency
Superscalar Modeling © J. E. Smith, 2006 44
Issue Buffer SizingIssue Buffer Sizing
y = 0.3115x
0
50
100
150
200
250
0 200 400 600 800
Reorder Buffer Size
Issu
e B
uff
er S
ize
Similar to ROB sizing Use average path rather
than average critical path
(See Tejas Thesis)
Processor ROB Size Issue Buffer
Ratio
Intel Core 96 32 .3
Power4 100 36 .4
MIPS R10K 64 20 .3
Pentium Pro
40 20 .5
Alpha 21264
80 20 .25
Opteron 72 24 .3
AMD K5 16 4 .25
Superscalar Modeling © J. E. Smith, 2006 45
Function Unit Demand VariationFunction Unit Demand Variation
0
0.2
0.4
0.6
0.8
1
2 12 22 32 42 52 62 72 82 92
DemandIALU
Instructions (millions)
MeanMean+1 stdevMean+2 stdevActual
Example: gcc
Superscalar Modeling © J. E. Smith, 2006 46
Function Unit ResourcesFunction Unit Resources
Demand proportional to instruction mix Dependent on program and phases
• Collect phase-based data Must be an integer Number of functional units of type k:
• Lk = issue latency for unit k
• Gk = fraction using unit k
Use similar approach for other hardware resources
Fk = I (Dk) + (2 (Dk)) Lk
Superscalar Modeling © J. E. Smith, 2006 47
Comparison With H&PComparison With H&P
H&P:
Total Time = Ntotal/α * (tp / p + to)
+ γ NH * ( to p + tp )
Empirical: fit to detailed simulation data to determine α and γ.requires re-simulation if caches/predictor/pipeline factor, etc. change
Interval Model:
Total Time = Ntotal/D (tp / p + to) + ((D-1)/2D)*(miL1 + mbr + mL2)* (tp / p + to)
+ miL1 * ciL1 * 1/ pb * (to piL1 + tp)
+ mbr * (cdr(p,D) /p + cfe) * 1/ pb * (to p + tp)
+ mL2 * (- W/Dp + clr(p,D)/p + cL2) * 1/ pb * (to pL2 + tp)
Mechanistic: Bottom-up -- no need to perform detailed simulationnot all hazard terms are linear in pnot all hazard terms are independent of D
Superscalar Modeling © J. E. Smith, 2006 48
Application: Performance ArchitectureApplication: Performance Architecture
Construct performance counters based on interval model Total cycle counter + one counter per miss event type Front-end miss events
• Front-end Miss Event Table (FMT) Back-end miss events
• Begin counting when full ROB stalls • Increment appropriate counter depending on inst. at ROB head
D-TLB miss,
L2 D-cache miss,
L1 D-cache miss,
Long functional unit (divide)
Superscalar Modeling © J. E. Smith, 2006 49
Performance Architecture: FMTPerformance Architecture: FMT
On entry per outstanding branch Tracks pre-window instructions
• between fetch and dispatch tail Tracks in-flight instructions
• between ROB tail and ROB head Table Increments
• For I1 or I2 miss or I-TLB increment counter pointed to by fetch
• Branch penalty counters between head and tail increment every cycle
Counter updates• When correctly predicted branch retires,
update I1, I2, I-TLB counters• When mispredicted branch retires, update
Branch mispredict counter (and continue counting until first instruction is dispatched))
Superscalar Modeling © J. E. Smith, 2006 50
Simplified FMTSimplified FMT
Shared I1, I2, ITLB entry Instructions in ROB marked w/ I-
cache miss or I-TLB miss When a miss instruction retires,
• Shared entry is copied to counters, • ROB tag bits are cleared
When a mispredicted branch retires
• Add to branch mispredict counter,• Clear shared entries
Superscalar Modeling © J. E. Smith, 2006 51
EvaluationEvaluation
Compare:• Simulation – add miss events one at a time and
measure difference• Simulation-rev – same as above, but reverse order of
miss events• naïve -- Count miss events, multiply by fixed penalty• naïve non-spec – Similar to above, but wrong-path
events not counted• Power5 – IBM Power5 method• FMT• sFMT
Superscalar Modeling © J. E. Smith, 2006 52
EvaluationEvaluation
Superscalar Modeling © J. E. Smith, 2006 53
ComparisonComparison
FMT and sFMT are most accurate• naïve is worst
FMT and sFMT similar• simplified version is adequate
Power5 underestimates frontend miss events
Superscalar Modeling © J. E. Smith, 2006 54
Interval Model DevelopmentInterval Model Development
Michaud, Seznec, Jourdan – Issue transient Tejas Gap model – All transients Taha and Wills -- Interval (macro block) model Hartstein and Puzak – Optimal pipelines
Superscalar Modeling © J. E. Smith, 2006 55
ConclusionsConclusions
Intervals yield a divide-and-conquer approach Supports intuition (adds confidence to intuition) Its all about transients
• The only things that count are cache miss and branch mispredictions
Application to automated design, performance monitoring, very fast simulation, optimizing compiler analysis, etc.
Analysis of pipeline limits,• Re-enforces conventional wisdom• We are close to the practical limits for depth and width
Extends to energy modeling (Tejas PhD)