ics’02 1 low-complexity reorder buffer architecture* *supported in part by darpa through the pac-c...
Post on 22-Dec-2015
222 views
TRANSCRIPT
![Page 1: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/1.jpg)
ICS’02 1
Low-ComplexityReorder Buffer Architecture*
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002
![Page 2: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/2.jpg)
ICS’02 2
Outline
ROB complexities
Motivation for the low-complexity ROB
Low-complexity ROB design
Results
Concluding remarks
![Page 3: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/3.jpg)
ICS’02 3
What This Work is All About
Complex, richly-ported ROBs are common in modern superscalar datapaths
Number of ports are aggravated when results are held within ROB slots (Example: Pentium III)
ROB complexity reduction is important for reducing power and improving performance
ROB dissipates a non-trivial fraction of the total chip power
ROB accesses stretch over several cycles
Goal of this work: Reduce the complexity and power dissipation of the ROB without sacrificing performance
![Page 4: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/4.jpg)
ICS’02 4
Pentium III-like Superscalar Datapath
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
![Page 5: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/5.jpg)
ICS’02 5
ROB Port Requirements for a W-way CPU
ROB
WritebackW write portsto write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
![Page 6: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/6.jpg)
ICS’02 6
ROB Port Requirements for a W-way CPU
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/Dispatch1 W-wide write port
to setup entries
Commit1 W-wide read port
for instruction commitment
![Page 7: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/7.jpg)
ICS’02 7
Where are the Source Values Coming From?
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 8: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/8.jpg)
ICS’02 8
Where are the Source Values Coming From ?
0%
20%
40%
60%
80%
100%
Forwarding ARF ROB
96-entry ROB, 4-way processorSPEC2K Benchmarks
62% 32%32% 6%
![Page 9: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/9.jpg)
ICS’02 9
How Efficiently are the Ports Used ?
ROB
WritebackW write ports
To write results
Dispatch/Issue2W read ports
to read the source operands
Decode/DispatchW write portsto setup entries
CommitW read portsfor instruction commitment
6%
![Page 10: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/10.jpg)
ICS’02 10
Approaches to Reducing ROB Complexity
Reduce the number of read ports for reading out the source operand values
More radical (and better): Completely eliminate the read ports for reading source operand values!
![Page 11: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/11.jpg)
ICS’02 11
0
4
8
12
16
1 read port 2 read ports
Reducing the Number of Read PortsP
erfo
rman
ce D
rop
%
048
121620
3.5% 1.0%Average IPC Drop:
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
![Page 12: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/12.jpg)
ICS’02 12
Problems with Retaining Fewer Source Read Ports on the ROB
Need arbitration for the small number of ports
Additional logic needed to block the instructions which could not get the port.
Need a switching network to route the operands to correct destinations
Multi-cycle access still remains in the critical path of Dispatch/Issue logic
![Page 13: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/13.jpg)
ICS’02 13
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 14: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/14.jpg)
ICS’02 14
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
12
3
![Page 15: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/15.jpg)
ICS’02 15
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
1
3
ROB
![Page 16: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/16.jpg)
ICS’02 16
Comparison of ROB Bitcells (0.18µ, TSMC)
Layout of a 32-ported SRAM bitcell
Layout of a 16-ported SRAM bitcell
Area Reduction – 71%
Shorter bit and wordlines
![Page 17: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/17.jpg)
ICS’02 17
Our Solution: Elimination of Read Ports
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Area Reduction – 45%
![Page 18: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/18.jpg)
ICS’02 18
Eliminating/Reducing the Number of Read Ports: Effects on Power Dissipation
Power is reduced because:shorter bitlines and wordlines
lower capacitive loading
fewer decoders
fewer drivers and sense amps
![Page 19: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/19.jpg)
ICS’02 19
Completely Eliminating the Source Read Ports on the ROB
The Problem: Issue of instructions that require a value stored in the ROB will stall
Solutions:
Forward the value to the waiting instruction at the time of committing the value: LATE FORWARDING
![Page 20: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/20.jpg)
ICS’02 20
Late Forwarding: Use the Normal Forwarding Buses!
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
![Page 21: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/21.jpg)
ICS’02 21
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
![Page 22: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/22.jpg)
ICS’02 22
Optimizing Late Forwarding
PROBLEM: If Late Forwarding is done for every result that is committed, additional forwarding buses are needed in order not to degrade the performance
SOLUTION: Selective Late Forwarding (SLF)
SLF requires additional bit in the ROBThat bit is set by the dispatched instructions that require Late Forwarding
No additional forwarding buses are needed, since SLF traffic is very small
![Page 23: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/23.jpg)
ICS’02 23
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Only 3.5% of the traffic is from
SELECTIVE LATE FORWARDING
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Result/status forwarding buses:
Late Forwarding: Use the Normal Forwarding Buses!
![Page 24: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/24.jpg)
ICS’02 24
0
4
8
12
16
No ROB read ports with SLF 1 read port 2 read ports
Performance Drop of Simplified ROB P
erfo
rman
ce D
rop
%
0
5
10
15
20
25
30
9.6% 3.5% 1.0%Average IPC Drop:
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
37%
17%
![Page 25: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/25.jpg)
ICS’02 25
IPC Penalty:Source Value Not Accessible within the ROB
ForwardingLate Forwarding/
Commitment
Lifetime of a Result Value
ResultGeneration
time
Valuewithin ARF
Valuewithin ROB
![Page 26: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/26.jpg)
ICS’02 26
Improving IPC with No Read Ports
Cache recently generated values in a set of RETENTION LATCHES (RL)
Retention Latches are SMALL and FAST
Only 8 to 16 latches needed in the set
Entire set has 1 or 2 read ports
![Page 27: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/27.jpg)
ICS’02 27
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
Architectural Register File
![Page 28: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/28.jpg)
ICS’02 28
Datapath with the Retention Latches
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
RETENTION LATCHES
ROB
![Page 29: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/29.jpg)
ICS’02 29
The Structure of the Retention Latch Set
L ROB slot addresses(L=1 or 2)
L-ported CAM field(key = ROB_slot_id)
W write ports for writing up to W results in parallel
Status
L recently-written results (L=1 or 2 works great)
Result Values
8 or 16 latches
![Page 30: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/30.jpg)
ICS’02 30
Retention Latch Management Strategies
FIFO
8 entry RL: 42% hit rate
16 entry RL: 55% hit rate
LRU
8 entry RL: 56% hit rate
16 entry RL: 62% hit rate
Random Replacement
Worse performance than FIFO
![Page 31: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/31.jpg)
ICS’02 31
Hit Ratios to Retention Latches
0
20
40
60
80
100
FIFO 8 2 FIFO 16 2 LRU 8 2 LRU 16 2
42% 55% 56% 62%
0
20
40
60
80
100
Hit
Rat
ios
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
Average Hit Ratio:
![Page 32: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/32.jpg)
ICS’02 32
Accessing Retention Latch Entries
ROB index is used as a unique key in the Retention Latches to search the result values
Need to maintain unique keys even when we have:
Reuse of a ROB slot:Not a problem for FIFO
simply flush a RL entry at commit time for LRU
Branch mispredictions
![Page 33: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/33.jpg)
ICS’02 33
Handling Branch Mispredictions
Selective RL Flushing: Retention latch entries that are in the mispredicted path are flushed
Uses branch tagsComplicated implementation
Complete RL Flushing: All retention latch entries are flushed
Very simple implementationPerformance drop is only 1.5% compared to selective flushing
![Page 34: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/34.jpg)
ICS’02 34
Misprediction Handling: Performance
0
0.5
1
1.5
2
2.5
3
3.5
bzip gap gcc gzip mcf pars perl twol vort vpr appl apsi art equ mesa mgrid swim wupw Int. FP Avg.
Selective flushing Complete flushing
1.5%Average IPC Drop:
IPC
![Page 35: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/35.jpg)
ICS’02 35
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 arch. 3
Src1 arch. 2
ADDInstruction
Instruction: ADD R1, R2, R3
![Page 36: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/36.jpg)
ICS’02 36
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
ROB#/Phys.
Rename Table
![Page 37: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/37.jpg)
ICS’02 37
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
1 7
Rename Table
ROB
![Page 38: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/38.jpg)
ICS’02 38
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
1 7
Rename Table
ROB
![Page 39: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/39.jpg)
ICS’02 39
Scenario 1: Traditional Design
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
0 ?
Rename Table
ROB
![Page 40: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/40.jpg)
ICS’02 40
Scenario 1: Traditional Design
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
0 ?
Rename Table
ROB
![Page 41: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/41.jpg)
ICS’02 41
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
![Page 42: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/42.jpg)
ICS’02 42
Scenario 1: Traditional Design
5ROB index
Src1 valid 1
Src1 value 7
43
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
1
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
![Page 43: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/43.jpg)
ICS’02 43
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 arch. 3
Src1 arch. 2
ADDInstruction
Instruction: ADD R1, R2, R3
![Page 44: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/44.jpg)
ICS’02 44
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
ROB#/Phys.
Rename Table
![Page 45: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/45.jpg)
ICS’02 45
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.value
… …
12
… …
7
Rename Table
RetentionLatches
![Page 46: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/46.jpg)
ICS’02 46
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Rename Table
ROB#/Phys.
Phys.value
… …
12
… …
7RetentionLatches
![Page 47: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/47.jpg)
ICS’02 47
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid ?
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Rename Table
ROB#/Phys.
Phys.value
… …
…
… …
…MISS RetentionLatches
![Page 48: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/48.jpg)
ICS’02 48
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
X XRename Table
ROB
ROB#/Phys.
Phys.value
… …
…
… …
…RetentionLatches
MISS
X: Don’t Care
SLF
…
…
0
![Page 49: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/49.jpg)
ICS’02 49
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 0
Src1 value ?
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
ROB#/Phys.
Phys.valid
Phys.value
… … …
12
… … …
X XRename Table
ROB
ROB#/Phys.
Phys.value
… …
…
… …
…RetentionLatches
MISS
X: Don’t Care
SLF
…
…
1
![Page 50: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/50.jpg)
ICS’02 50
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
?
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
?
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
![Page 51: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/51.jpg)
ICS’02 51
Scenario 2: Simplified ROB with RLs
5ROB index
Src1 valid 1
Src1 value 7
43
Src2 valid
Src2 value
Simplified IDB entry #1
Src2 reg. 3
Src1 reg. 2
ADDInstruction
Instruction: ADD R1, R2, R3
Arch.ROB#/Phys.
ROB=0ARF=1
0
1
2
3
4
…
… …
… …
… …
… …
12
3
0
1
1
Arch. Arch.value
… …
3
… …
43Rename Table
ARF
![Page 52: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/52.jpg)
ICS’02 52
Experimental Setup: the AccuPower (DATE’02)
CompiledSPEC
benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdeck
SPICE
MicroarchitecturalSimulator(Rooted in
SimpleScalar)
Energy/PowerEstimator
Power/energystats
SPICE measures ofenergy per transition
Transition counts,Context information
![Page 53: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/53.jpg)
ICS’02 53
Configuration of the Simulated System
Machine width 4-way
Issue Queue 32 entries
96 entriesReorder Buffer
Load/Store Queue 32 entries
Simulated the execution of SPEC2000 benchmarks
![Page 54: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/54.jpg)
ICS’02 54
Assumed Timings
Rename Tablelookup forROB index
Rename TableLookup forROB index
Associativelookup ofoperand fromretention latchesusing ROBindex as a key
Source operandread from the ROB
Source operandread from the ROB
Smaller delay:few latches
D1 D2 D3 D1 D2
Timing of the baseline model Timing of the simplified ROB
![Page 55: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/55.jpg)
ICS’02 55
-5
-3
-1
1
3
5
8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PerformanceP
erfo
rman
ce D
rop
%
-6
-4
-2
0
2
4
6
0.1% -1.6% -1.0% -2.3%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
![Page 56: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/56.jpg)
ICS’02 56
0
2
4
6
8
8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PerformanceP
erfo
rman
ce D
rop
%
0
2
4
6
8
10
3.3% 1.7% 2.3% 1.0%
applu apsi art equake mesa mgrid swim wupwise FP Avg.
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
Avg. IPC Drop:
![Page 57: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/57.jpg)
ICS’02 57
0
10
20
30
40
No RO B ports 8 2-ported FIFO 8 2-ported LRU 16 2-ported FIFO 16 2-ported LRU
Experimental Results: Effect on PowerP
ower
Sav
ings
%
0
10
20
30
40
50
bzip2 gap gcc gzip mcf parser perl twolf Int Avg.vortex vpr
applu apsi art equake mesa mgrid swim wupwise FP Avg.
30% 23.4% 22.2% 21% 20.2%Avg. Savings:
![Page 58: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/58.jpg)
ICS’02 58
Summary of Results
Significantly reduced ROB complexity and power dissipation
45% area reduction
20% to 30% power reduction across SPEC 2000 benchmarks
Actual IPC improvements:
1.6% to 2.3% gain across SPEC benchmarks
IPC gains come from 1 cycle access to RL (vs. 2 cycles that would be needed for ROB access)
![Page 59: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/59.jpg)
ICS’02 59
Related Work
Value-Aging Buffer (Hu & Martonosi, PACS 2000)
Forwarding Buffer and Clustered Register Cache (Borch et.al., HPCA’02)
Multiple Register Banks (Cruz et.al., ISCA’00 & Balasubramonian et.al., MICRO’01)
See paper for discussions
![Page 60: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/60.jpg)
ICS’02 60
Conclusions
Typical source operand location statistics can be successfully exploited to reduce ROB complexity
Significant reduction in ROB area and power – no ROB ports needed for reading source operands
IPC gains are possible because of the use of a small sized, low-ported Retention Latch to supply cached operand values in a single cycle
![Page 61: ICS’02 1 Low-Complexity Reorder Buffer Architecture* *supported in part by DARPA through the PAC-C program and NSF Gurhan Kucuk, Dmitry Ponomarev, Kanad](https://reader036.vdocuments.us/reader036/viewer/2022062715/56649d785503460f94a5b0e7/html5/thumbnails/61.jpg)
ICS’02 61
Low-ComplexityReorder Buffer Architecture*
*supported in part by DARPA through the PAC-C program and NSF
Gurhan Kucuk, Dmitry Ponomarev, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
16th Annual ACM International Conference on Supercomputing (ICS’02), June 24th 2002