implementing optimizations at decode time
DESCRIPTION
Implementing Optimizations at Decode Time. Ilhyun Kim Mikko H. Lipasti Pharm Team University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. What this talk is about. It’s not about new optimizations Memory reference combining Silent store squashing It’s not about decode - PowerPoint PPT PresentationTRANSCRIPT
Implementing Optimizations
at Decode Time
Ilhyun KimMikko H. Lipasti
Pharm TeamUniversity of Wisconsin—
Madisonhttp://www.ece.wisc.edu/~pharm
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 2
What this talk is about
It’s not about new optimizations Memory reference combining Silent store squashing
It’s not about decode How to build an instruction decoder
It is about implementation A way to implement dynamic optimizations in a
pipeline w/ speculative scheduling
“Implementing Optimizations at Decode time”
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 3
Outline Speculative Scheduling
Why it causes problems with dynamic optimizations
Speculative Decode Enables dynamic optimizations in the processor core
Case Study: Memory Reference Combining
Case Study: Silent Store Squashing
Conclusions
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 4
Where do you want to put optimizations?
Optimization trade-offs
Trace cache
Decode /Trace Cache fill
Execution Core
Instr cache
Host machine
Virtual machine
Binary Translation / OptimizationProcess
or
Compiler
Execution Core
Decode
Fetch
MostGlobal
MostDynamic
Can we achieve fully dynamic optimizations?
Dynamic events affect execution
for the very next clock cycle
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 5
Speculative Scheduling
Fetch DecodeIssue/Exe
Writeback Commit
Atomic wakeup/select
Fetch Decode Schedule Dispatch RF Exe Writeback Commit
non-atomic wakeup/select
Fetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback Commit
Wakeup/Select
Fetch Decode Schedule Dispatch RF Exe Writeback Commit
Wakeup/SelectSpec wakeup
/select
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Spec wakeup/select
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Re-schedulewhen latency mispredicted
Latency Changed!!
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Invalid input value
Speculatively issued instructions
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Speculatively issued instructions
Overview
Unlike the original Tomasulo’s Algorithm, Instructions are scheduled based on pre-determined
latency Resources are allocated at schedule time Once instructions leave scheduler, it is impractical to
change resource/execution scheduling
Pipeline CANNOT react to observed events immediately
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 6
lw r1 4(r29) add r2 r1 + 1
What becomes harder?
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
WakeupLoad latency 2
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Issuenow
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
value found in RF
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Load lat 1!!Bubble aheadmove the value
cancel cache access
Fully dynamic optimization in execution stage is hard
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Re-schedulewhen latency mispredicted
Load lat 1!!Now, it executes
- no benefitNO BENEFIT!!reduced
load latency
Optimization: Avoid the cache access if the value is available in RF
(Load and store reuse using register file contents, ICS 2001)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 7
Speculative scheduling breaks fully dynamic optimizations
Optimizing a parent instruction is not enough Benefits come from dependent (data, resource)
instructions that execute sooner Instructions cannot react immediately under speculative
scheduling
Some techniques become less efficient, or even unavailable if they depend on:
Instant re-execution Variable execution latency Instant resource allocation/deallocation
The scheduler should know what will happen in advance
not fully dynamic – predictor required How to communicate with the scheduler?
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 8
lw r1 4(r29)add r2 r1 + 1
Our Solution
Predictor
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
Transform this (e.g. loadmove)assuming that it will happen
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
lw r1 4(r29) move r1p1add r2 r1 + 1
Optimization isinvisible to scheduler
(Since it’s ‘move’ instr)
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
wakeup dependentinstr(‘move’ lat 1)
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
Fetch Decode Schedule Dispatch RF ExeWriteback/Recover
Commit
Optimization target
Appears to beoptimized here
Benefit fromreduced latency
Optimization: Avoid the cache access if the value is available in RF
(Load and store reuse using register file contents, ICS 2001)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 9
Speculative Decode (SD) Decoding instructions into an optimistic sequence rather than one that works correctly in all cases
(unsafe) reaps benefits of fully dynamic optimization when correctly predicted requires verification code for correctness flushes the pipeline when mispredicted
lw r1 36(r29)add r3 r1 r5
Speculative
Decode
lw r1 36(r29)p2 = predicted valuebne r1 p2 softtrapadd r3 p2 r5
ex) Load Value Prediction
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 10
Benefits of SD Pre-schedule optimizations outside the OoO core
enables dynamic optimizations more effectively eliminates resource contention than even fully
dynamic optimizations – leads to better performance
Implementing optimization using existing I-ISA primitives implement microarchitectural ideas with minimal core change reuse existing data/control path in the core minimize negative effects on the scheduler – invisible to the
scheduler
Fetch Decode
Predictor
OoO Execution Core Commit
OriginalInstructions
TransformedInstructions
When mispredicted, Squashing & Refetching (same as branch mispredictions)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 11
Translation Layer for SD Many decoders already have a translation layer between
user-ISA and implementation-ISA Because direct implementation of complex instructions is
difficult P6, Pentium 4, Power4, K7……
Functionality required for SD One-to-multiple instruction expansion (x86 decoders) Dynamically variable mapping between U-ISA and I-ISA
(experimental S/390)
Reducing the decode overhead Trace cache / Decoded instruction cache (Pentium 4) Instruction-path Coprocessors (Chou and Shen, ISCA 2000) Performance drop is not drastic w/ extra decode stages
(Sensitivity study in the paper)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 12
Outline
Speculative Scheduling
Speculative Decode
Case Study: Memory Reference Combining
Case Study: Silent Store Squashing
Conclusions
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 13
Case Study: Memory Reference Combining
Discussed extensively in literature Wilson et. al: Increasing cache port efficiency for dynamic superscalar
microprocessors, ISCA 1996
LW 100
LW 404
LW 104
…
LW 400
LW 104
100104
400404 Cache /Memor
y
64-bit datapat
h
64-bit data buffer
Load completed
LSQ
combinableload issues
Byte selection
One cache access satisfies multiple loads (load all scheme) Cache port / latency benefits
BUT, speculative scheduler should know if they can be combined
fails to achieve both benefits
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 14
Reference Combining via SD Wide data paths in support of instruction set
extensions AMD Hammer project: x86-64bit PowerPC 64bit implementations Multimedia extensions (SSE, MMX, Altivec…)
Many programs are still written in 32-bit mode for backward compatibility
SD enables existing binaries to benefit from wider data paths w/o recompilation
Wider (128-bit) combining leads to more benefits (performance data in the paper)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 15
Reference Combining via SD Detecting combinable pairs statically
Same base register with word-size offset Two adjacent word memory instructions in program order
lw r1, 0(r10)
lw r2, 4(r10)
lw r3, 8(r10)
lw r4, 12(r10)
Predict alignment of references
dlw r1, 0(r10)exthi r2, r1
dlw r3, 8(r10)exthi r4, r3
doubleword-aligned
lw r1, 0(r10)
dlw r2, 4(r10)exthi r3, r2
lw r4, 12(r10)
word-aligned only
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 16
Pipeline front-end
When misaligned, The memory system detects it (same as base case) After the pipeline is drained, the original instructions
are fetched again and decoded without transformation
Reference Combining via SD
Fetch Decode
Predictor
Sequence
Detector
adjacentloads/stores
combining predicted
lw+lwsw+sw
dlw + extract
merge + dsw
ToExecution
Core
alignment historyof loads/stores
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 17
Microarchitectural Assumptions
SimpleScalar PISA w/ Speculative scheduling 4-wide; 8-stage pipeline Hybrid branch predictor(Gshare + bimodal) 64-entry RUU; 32-entry load / 16-entry store
schedulers 64KB I-DL1, 512KB unified L2 2 load / 1 store ports (mutually exclusive) 2 store buffers outside the OoO core
HW memory reference combining (HWC) Magic scheduler w/ perfect combining knowledge+ store merging in store buffer (for store combining)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 18
0.4
0.5
0.6
0.7
0.8
0.9
1
com
p
gcc
go
ijpeg
li
m88
k
perl
vort
ex
bzip
gzip
mcf
pars
er vpr
No
rma
lize
d D
L1
acc
ess
HWC w/ oracle schedulerSDC
HWC vs. SDC – cache access reductions
HWC reduces more cache accesses than SDC
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 19
0
0.2
0.4
0.6
0.8
1co
mp
gcc go
ijpeg
li
m88
k
perl
vort
ex
bzip
gzip
mcf
pars
er vpr
Load
/sto
re s
ched
uler
ful
l rat
eHWC vs. SDC – LSQ contention reductions
SDC reduces LSQ contention more (fewer memory instructions)
BaseHWC w/ oracle scheduler
SDC
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 20
0
2
4
6
8
10
12co
mp
gcc go
ijpeg
li
m88
k
perl
vort
ex
bzip
gzip
mcf
pars
er vpr
Spe
edup
s %
HWC w/ oracle scheduler
SDC
SD can reap many of benefits of pure hardware implementations
HWC vs. SDC – Speedups
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 21
Outline
Speculative Scheduling
Speculative Decode
Case Study: Memory Reference Combining
Case Study: Silent Store Squashing
Conclusions
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 22
Case Study: Silent Store Squashing(SSS)
Eliminates stores that do not change architectural states Reducing core and memory system contention
Separate load/store schedulers imply replication more contention Do explicit
conversion
Memory
(1) access memory
=
(2) compare values
(3) nullify the store when silent
Store
Store Scheduler
V
Load (converted)
Load Scheduler
V
Value
A store is converted implicitly into 3 operations
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 23
Silent Store Squashing via SD Explicitly removes predicted stores
reduces store scheduler contention
add r1, 1, r1
lw r5, 4(r10)
sw r1, 16(r29)sw r1, 16(r29)lw p1, 16(r29)bne p1, r1, trap
load + trap for store verify the pipeline is drained when not silent
No store, No aliasing silent stores do not change the value no RAW allowing later loads to bypass earlier unresolved stores even
with true dependences
no RAW
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 24
HWSSS vs. SDSSS – memory disambiguation
0.0
0.2
0.4
0.6
0.8
com
p
gcc
go
ijpeg li
m88k
perl
vort
ex
bzip
gzip
mcf
pars
er
Avg c
lock c
ycle
s o
f
a lo
ad is
sue b
locked b
y s
tore
s
BaseSD SSS
Better memory disambiguation achieved no store less store-to-load block cycles
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 25
-2
3
8
13
18
23co
mp
gcc go
ijpeg
li
m88
k
perl
vort
ex
bzip
gzip
mcf
pars
er vpr
Spe
edup
s %
HWSSS SDSSS
HWSSS vs. SDSSS – Speedups
SDSSS outperforms HWSSS Better memory disambiguation HWSSS does not reduce contention in the store scheduler
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 26
Conclusions Speculative scheduling makes optimizations in
execution stage impractical Pre-schedule optimization by transforming
instructions
Advantages of SD-based implementations Enabling execution stage optimizations Reusing existing data/control path No negative effect on instruction scheduling Reduces contention inside the core better
Two case studies show that SD can reap many benefits of pure hardware implementations
Memory reference combining: less queue contention Silent store squashing: less queue contention, better
memory disambiguation
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 27
Backup slides
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 28
Is HWC easy to integrate??
Schedule Detecting piggyback loads to
be issued at the same clock cycle
Actual values are not involved in scheduling
Detecting them without effective addresses?
Register File & Result bus More loads satisfied at the
same time more result bus bandwidth more RF write ports
RegisterFile
Schedule
LSQ
Addrgen
Load unit
RF
Exe1
Exe2
Result bus
RegisterFile
MagicSchedule
LSQ
Addrgen
Load unit
RF
Exe1
Exe2
Result bus
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 29
SD Combining Prediction
0
10
20
30
40
50
com
p
gcc
go
ijpeg li
m88k
perl
vort
ex
bzip
gzip
mcf
pars
er
vpr
Perc
enta
ge o
f all m
em
ory
refe
rences
Mispredicted
Combinable but not predicted
Combined
Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)
Miss rates: ~0.1% of all references
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 30
0
20
40
60
80
100com
p
gcc
go
ijpeg li
m88k
perl
vort
ex
bzip
gzip
mcf
pars
er
vpr
Perc
enta
ge o
f all s
tore
s Not silent / predicted not silentNot silent / predicted silentCheckedSilent / predicted not silentSilent / predicted silent
Silence Prediction
45% of silent stores detected (1024 entries, ~2.5KB) Low miss rate (~1%)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 31
Silence Predictor
Last Store Value(lower n bits)
Confidence counter
Threshold counter
PC
Current Store Value(lower 8 bits)
+1(same) / -1(different)
=
Store verify result
-1(silent) / +4(not silent)
Predict
Time Silence Predictor State
100 3 4value confid thres
Store 100 to [A](a) PC x: Store 100, [A] A 100B 100
100 4 4Load from [B]
CompareStore 100 to [B]
(b) PC x: Store 100, [B] A 100B 100
(c) PC y: Store 50, [B] A 100B 50
Load from [A]Compare(d) PC x: Store 100, [B] 100 5 3 A 100
B 50
MemoryDecoded OpsOriginal Instruction
100 0 7 Store 100 to [B] A 100B 100
(e)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 32
Silence Squashing via SD Explicit store verify depending on predictor states:
store load+compare+store load+branch
Eliminate negative effects on scheduling logic Explicit load issue for verify SSS is virtually invisible to scheduling logic
Explicit compare/branch operation Existing branch unit maintains correct machine states
Fetch Decode
Predictor silent store predicted
store load+comp+store
load+branch
ToExecution
Core
last store valueis silent?
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 33
SD Combining Prediction Predictor
Tag +4 /-4 (1)
Next Target Register
Alignment History(4)0 1 2 3
PC
= 1111 or 1010 ? Predict
Is aligned?
=
Shift
Combining is predicted when: 1111: aligned 4 times in a row 1010: base is increasing by 4
Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)
Miss rates: ~0.1% of all references Capturing up to 26% of all memory references
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 34
Silence Prediction Avoiding unnecessary store verify (load issue) How do we train the predictor without silence
outcome? Silence outcome is achieved when we do store verify
Correlating the last value information for training
No squashno SD
Squashload+trap
Checkload+compar
e+storeThe same value
for several instances
SilentNot Silent
Not Silent
Silent
Different values
45% of silent stores detected (1024 entries, ~2.5KB)
Low miss rate (~1%)
May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 35
Future Work Spectrum of power / performance design points
attainable by speculative decode Single core, multiple marketing targets
Exposing complex control paths to I-ISA Improving controllability on processor core achieving
more benefit from SD Developing I-ISA for complexity-effective core design
And more……