dynamically collapsing dependencies for ipc and frequency gain
DESCRIPTION
Dynamically Collapsing Dependencies for IPC and Frequency Gain. Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu. Motivation. Outside of pipeline, global communication dominates Memory wall is well studied - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/1.jpg)
Dynamically Collapsing Dynamically Collapsing Dependencies for IPC and Dependencies for IPC and Frequency GainFrequency GainPeter G. SassoneD. Scott Wills
Georgia TechElectrical and Computer Engineering{ sassone, scott.wills } @ ece.gatech.edu
![Page 2: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/2.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 2
MotivationMotivation
• Outside of pipeline, global communication dominates
• Memory wall is well studied• Inside, traditionally computation or logic
dominated
fetchdecoderename
issueexec
commit
I cache
D cache
L2 cache memory
![Page 3: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/3.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 3
MotivationMotivation
issuequeue
• Now dominated by local communication paths:– issue window– reorder buffer– register file– bypass network
• Bottlenecks both IPC and frequency
issuelogic
alu
alu
alu
regfile
![Page 4: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/4.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 4
MotivationMotivation
issuequeue
• RISC instruction sets create RISC instruction sets create superfluous trafficsuperfluous traffic• All instructions and operands are treated as All instructions and operands are treated as
equalequal• Little focus on exposing Little focus on exposing sequentialitysequentiality
issuelogic
alu
alu
alu
regfile
![Page 5: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/5.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 5
ContributionsContributions• Dynamic Strands:
– collapse dependence-chains without fan-out– exploit properties for simple value precomputation– increase efficiency of critical resources– preserve binary compatibility
• IPC improvements:– 17-20% speedup on Spec2000int and MediaBench
• Frequency improvements:– 37% fewer in-flight instructions– reduced dependence on dependencies
![Page 6: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/6.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 6
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
![Page 7: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/7.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 7
Dyadic DilemmaDyadic Dilemma
Performing any operation on more than two sources requires temporary values
R1’
R1’’
R1 R2
R3
R4
R9
+
+
+. . .add R1 R1, R2add R1 R1, R3add R9 R1, R4. . .
int sum( int a, int b, int c, int d ){ return a + b + c + d;}
![Page 8: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/8.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 8
0%
10%
20%
30%
40%
50%
60%jp
eg en
code
jpeg
dec
ode
epic
enco
deep
ic de
code
g721
dec
ode
g721
enco
demp
eg2
deco
demp
eg2
enco
depe
gwit
deco
depe
gwit
enco
dead
pcm
enco
dead
pcm
deco
de bzip gcc
gzip mcf
parse
rvo
rtex
vpr
Mediabench Spec2000int
perc
ent o
f dyn
amic
ope
rand
s
.
Transient OperandsTransient Operands• We term these temporary values transient operands:
– values produced by an ALU inst– values consumed only once, and only by an ALU inst
• Common in modern integer workloads…
On average, about 40% of all dynamic operands are transient
![Page 9: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/9.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 9
StrandsStrands• Strands:
– linear chains of instructions joined by transient operands
– non-consecutive– span basic blocks– three instructions– only the final output needs
to be committed
• Strands are common– dyadic temporaries– compiler strategies– language semantics
+c
d
ba
+
+
![Page 10: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/10.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 10
Outline
• Motivation• Transient Operands and Strands• Instruction Replacement Hardware• Results• Conclusion
![Page 11: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/11.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 11
closed-loopALUs
Hardware Overviewfetch
decode
rename
reg file
commit
ALU ALU ALU
strand cachefill unitinstructions
strandcache
transients
dispatchengine
strands
instructions
strandsissue queue
off the critical path
![Page 12: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/12.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 12
Algorithm ExampleAlgorithm Example
closed-loopclosed-loopALUsALUs
fetch
decode
rename
reg file
commit
ALU ALU ALU
strand cachestrand cachefill unitfill unitinstructions
strandstrandcachecache
transients
dispatchdispatchengineengine
strands
instructions
strandsissue queueissue queue
1
2
3
11
![Page 13: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/13.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 13
Strand Cache Fill UnitStrand Cache Fill Unit
• Based around the operand table• Detects conditions of transients• When found…
– append to existing strand– begin new strand
last producerlast producerinstructioninstruction
last consumerlast consumerinstructioninstruction
consumeconsumerr
countcount
R5R5R6R6
R4R4
archarchregreg
1404: R5 R0 + 0
PC 14161412: R1 R5 + 0
1416: R5 R0 + 0
1408: . . . PC 1404 PC 1412 1
operand tableoperand table
![Page 14: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/14.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 14
Strand Cache
101110101
status bits previous reader info
strand 2 i1 i2 i3 pc ready value
instructions
seen pc inst seen pc instseen pc instthis instruction source 1 source 2
++
+
About 175 bytes per line, though very few lines are needed for effect
strand 1
strand 3
![Page 15: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/15.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 15
Dispatch EngineDispatch Engine
• Watches for strand cache matches• Inserts ready strands into the stream eagerly• Removes component instructions when seen• Correctness checking with dirty table
dispatchdispatchengineengine
decodedecode
renamerename
pre-renamedinstructions
strands,recovery strands,
kill signals,
dirtytable
strandstrandcachecache
![Page 16: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/16.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 16
Closed-Loop ALUsClosed-Loop ALUs
• Full bypass is half of the execute stage delay• Regular ALUs with double-speed closed-loop
mode– two dependent ALU operations in a single cycle– intermediate values (the transients) are discarded!– final result still takes ½ cycle for full bypass
full bypass network
“free”local
bypass
mode switch
ALUALU ½ cycle
½ cycle
![Page 17: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/17.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 17
Oops… Dirty ReadOops… Dirty Read
R1’
R1’’
R1 R2
R3
R4
R9
+
+
+load 16 [ R1 ]
R1’
R1’’
R1 R2
R3+
+
insert recovery
sub-strand to recover R1R1 is dirty!
![Page 18: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/18.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 18
Oops… Anti-Dependence Oops… Anti-Dependence ViolationViolation
R1’
R1’’
R1 R2
R3
R4
R9
+
+
+load 32 [ R9 ]
insert load immediate of previous
value
R9 has already been replaced
R9 previous valueprevious value
renaming not sufficent –
outside reorder
buffer safety net
![Page 19: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/19.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 19
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
![Page 20: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/20.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 20
coverage with various strand cache sizes
Instruction CoverageInstruction Coverage
0%
10%
20%
30%
40%
50%
60%
jpeg
enco
de
epic
enco
de
g721
enco
de
mpeg
2en
code
pegw
iten
code
adpc
men
code gcc
gzip
parse
r
vpr
aver
age
Mediabench sample Spec2000int sample
ALU
Inst
ruct
ion
Cove
rage
.16 cache entries64 cache entries256 cache entries1024 cache entries
High coverage rates, but only
with a big strand cache.
Less than a 15%
replacement rate,
regardless of cache size
Average ALUinst
coverage:
16: 12%1024: 27%
![Page 21: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/21.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 21
4-wide IPC speedup with 16-entry strand cache
1.01.11.21.31.41.51.61.71.81.92.0
jpeg
enco
dejp
eg d
ecod
eep
ic en
code
epic
deco
deg7
21 d
ecod
eg7
21 en
code
mpeg
2 de
code
mpeg
2 en
code
pegw
it de
code
pegw
it en
code
adpc
m en
code
adpc
m de
code bzip gcc
gzip mcf
parse
rvo
rtex
vpr
harm
mea
n
Mediabench Spec2000int
IPC
Spee
dup
.
IPC ImprovementsIPC ImprovementsAverage IPC
Speedup:
4-wide: 17%8-wide: 20%
Some benchmark
s almost double in
IPC
Some see almost no speedup at all
![Page 22: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/22.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 22
strandstrand
Resource OccupancyResource Occupancy
• CISCification of instructions reduces traffic– reorder buffer occupancy is reduced up to 37%.– issue queue occupancy is reduced up to 34%.– traffic reduction coverage
• Reduced dependence on dependencies– opportunity for pipelined bypass– opportunity for pipelined issue.
![Page 23: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/23.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 23
strandstrand
Resource OccupancyResource Occupancy
• Caveat emptor– more worst case issue CAMs– more worst case register ports
• Prior work applicable– only 1.2 live inputs / strand
+ + ++ ++ + ++ +
![Page 24: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/24.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 24
OutlineOutline
• MotivationMotivation• Transient Operands and StrandsTransient Operands and Strands• Instruction Replacement HardwareInstruction Replacement Hardware• ResultsResults• ConclusionConclusion
![Page 25: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/25.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 25
ConclusionConclusion• Key points:
– eagerly executing macro-instructions value precomputation
– limiting focus to transient operands– all new hardware off critical path
• Results:– IPC speedup of 18-20% with 3KB strand cache– potential for frequency gains– full binary compatibility
• Lots of current and future research:– relaxed constraint of ALU instructions– quantified frequency improvements– static detection of strands
Questions?Questions?
![Page 26: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/26.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 26
Backup SlidesBackup Slides
![Page 27: Dynamically Collapsing Dependencies for IPC and Frequency Gain](https://reader035.vdocuments.us/reader035/viewer/2022081604/5681678c550346895ddcac24/html5/thumbnails/27.jpg)
MICRO-37 Sassone & Wills / Georgia Tech / Dynamic Strands 27
Sensitivity to Dispatch DelaySensitivity to Dispatch Delay
1.01.11.21.31.41.51.61.71.81.92.0
jpeg
enco
dejp
eg d
ecod
eep
ic en
code
epic
deco
deg7
21 d
ecod
eg7
21 en
code
mpeg
2 de
code
mpeg
2 en
code
pegw
it de
code
pegw
it en
code
adpc
m en
code
adpc
m de
code bzip gcc
gzip mcf
parse
rvo
rtex
vpr
harm
mea
n
Mediabench Spec2000int
IPC
Spee
dup
.0 cycle delay
3 cycle delay
4-wide IPC speedup with 16-entry strand cache
On average, speedup only
drops 1% with three cycles of
delay
Some actually get faster due to less errant
strands
Most benchmarks lose a small
amount of speedup