reducing datapath energy through the isolation of short-lived operands
Post on 01-Feb-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Reducing Datapath Energy Through the Isolation of Short-Lived Operands
Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad GhoseDepartment of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpower
2
Outline
– Introduction– Motivations– Contributions
Basic idea: isolate short-lived operands in a small dedicated register file and avoid their writes to the ROB and the ARF
Resources impacted: ROB, ARF Power savings: 21% with 32-entry additional RF
– Results– Conclusions– Future work
3
IQ
FunctionUnitsInstruction Issue
F1 D1
FU1
FU2
FUm
ARF
Result/status forwarding buses
EX
Instruction dispatch
Architectural Register File
F2
Fetch Decode/Dispatch
D2
D-cache
LSQ
ROB
A P6-like Superscalar Datapath
4
Out-of-Order Execution and In-Order Retirement
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
5
Energy-dissipating Events
ROB
F R D
Inst. Queue ExARF
In-order front end
Out-of-order core
In-order retirement
WriteWrite
Read
6
The Idea : Isolating Short-Lived Values
ROB
F R D
Inst. Queue ExARF
Write
Write
Read
SRF
Write short-lived values into a small
dedicated RF (SRF)
In-order front end
Out-of-order core
In-order retirement
7
– Used to avoid false data dependencies.– A new physical register is allocated for EVERY new
result– P6 style: ROB slots serve as physical registers
Register Renaming
LOAD R1, R2, 100
SUB R5, R1, R3 ADD R1, R5, R4
LOAD P31, P2, 100
SUB P32, P31, P3
ADD P33, P32, P4
8
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
Original code
9
– Register Alias Table (RAT) maintains the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 5 1
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
LOAD P31, R2, 100
Original code
Renamed code
10
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
LOAD P31, R2, 100SUB P32, P31, R3
Original code
Renamed code
11
– Rename Table (RT) is used to maintain the mappings between logical and physical registers
Register Renaming: the Implementation
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
Original code
Renamed code
12
– Our definition: a value is short-lived if the destination register is renamed by the time of the result generation.
– Identified one cycle before the result writeback
Short-Lived Values
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4RENAMER
13
0
10
20
30
40
50
60
70
80
90
10096-entry ROB, 4-way processor
The Good News : 80%+ of the Values are Short-Lived
As rename-to-writeback latency increases in future datapaths, the percentage of short-lived values will also go up
14
The Idea : Isolating Short-Lived Values
ROB
F R D
Inst. Queue ExARF
Write
Write
Read
SRF
Write short-lived values into a small
dedicated RF (SRF)
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
In-order front end
Out-of-order core
In-order retirement
15
Need to hang on to the short-lived values to:Recover from branch mispredictionsReconstruct precise state
Why do we need the SRF ?
LOAD R1, R2, 100BEQ R5, R1, #100ADD R1, R5, R4
16
– Maintain the bit-vector Renamed– Set by the Renamer at the time of renaming
Identifying Short-Lived Values
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31
1
Renamed
17
– Maintain the bit-vector Renamed– Set by the Renamer at the time of renaming
Identifying Short-Lived Values
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31
1
Renamed
18
– Renamed bit is checked one cycle before writeback– Value produced by LOAD is short-lived because
Renamed [31]=1
Identifying Short-Lived Values
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31
1
Renamed
19
– When do we write short-lived values into the SRF?
– When and how are the short-lived values removed from the SRF?
– What happens on a branch misprediction?
– How do we reconstruct a precise state?
Managing the SRF: the Issues
20
Format of an SRF entry
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest. Arch. Reg.
Branch Identifier for Renamer : used to remove this entry if renamer gets squashed
Branch Identifier for this instruction : used to remove this entry if this instruction gets squashed
Branch Identifier of an instruction = id/tag of immediately preceding conditional branch
21
– An instruction writes a short-lived result value into the SRF if:
A free entry exists in the SRF No SRF entry keyed with the same ROB slot is already
established– Bit-vector Allocated_in_SRF is maintained– One bit for each ROB entry– Set at the time of writeback if value is written into the SRF– Reset at the time of removing the value from the SRF
Writing to the SRF: the Conditions
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest. reg
22
Scenario 1 : Normal Commitment of Renamer
Scenario 2 : Renamer gets squashed
Scenario 3 : The instruction generating the short- lived value itself gets squashed
Scenarios for Removing the Values from the SRF
23
– Values are removed by the Renamer– 2-step process:
Mark the instruction whose value is to be removed from the SRF (done at the time of renaming)
Remove the marked value from the SRF IF NEED BE (done at the time of commitment)
– When ADD commits, it removes the value written by LOAD
Removing the Values from the SRF : Scenario 1
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4Renamer
24
Marking the Values for Removal
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31ROB
LO
AD
SU
B
32 33
25
Marking the Values for Removal
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 32 0
LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31ROB
LO
AD
SU
B
AD
D
32 33
31
FS (Flush SRF) field of the ROB
26
– FS field of B must match the ROB index field of a SRF entry
– This SRF entry must belong to A
Removing the Values (B is the renamer for A)
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
31
LO
AD
SU
B
AD
D32 33
31
SRF
ROB
1 31 1 load
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
A B
27
Another Example (LOAD could not write to SRF)
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 32 0LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
Original code
Renamed code
SRF was full!31
1
Renamed
28
Another Example
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 2 1
3 3 1
4 4 1
5 5 1LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
…MUL R2, R3, R4DIV R2, R2, R5
Original code
Renamed codeCommitted
31
0
Renamed
Committed
29
Another Example
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 31 0
3 3 1
4 4 1
5 5 1LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
…MUL P31, R3, R4
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
…MUL R2, R3, R4DIV R2, R2, R5
Original code
Renamed codeCommitted
31
0
Renamed
Committed
30
Another Example
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 33 0
2 32 0
3 3 1
4 4 1
5 5 1LOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
…MUL P31, R3, R4DIV P32, R31, R5
LOAD R1, R2, 100SUB R5, R1, R3ADD R1, R5, R4
…MUL R2, R3, R4DIV R2, R2, R5
31
1
Renamed
Original code
Renamed codeCommitted
Committed
31
Another Example (A’s ROB slot is assigned for C)
31
LO
AD
SU
B
AD
D
32 33
31
SRFROB
0
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
A BLOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
32
Another Example (A’s ROB slot is assigned for C)
31
MU
L
DIV
AD
D
32 33
31
SRFROB
1 31 2 mul
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
C BLOAD P31, R2, 100SUB P32, P31, R3ADD P33, P32, R4
…MUL P31, R3, R4DIV P32, R31, R5
D
33
– Bit-vector Uncommitted_Write is maintained One bit for each ROB entry Set at the time of establishing SRF entry Reset at the time of commitment
– Instruction B removes the value written by A (allocated to ROB slot i) if:
Allocated_in_SRF[i]=1, and Uncommitted_Write[i]=0;
Ensuring that the right values are removed
34
– When an instruction allocated to ROB slot i commits and Allocated_in_SRF[i]=1, the data is not copied to the ARF.
Avoiding Unnecessary Committments
Dest. reg
ROB
F R D
Inst. Queue ExARF
Write
Read
SRFWrite
35
– Problem: Renamer can get squashed -> stale entries remain in the SRF if
nothing is done
– Example:
Handling Branch Mispredictions : Scenario 2
32
BR
SU
B
AD
D33 34
31
ROB
SRF
1 31 1 load
LO
AD
31
36
– Problem: Renamer can get squashed -> stale entries remain in the SRF if
nothing is done
– Example:
Handling Branch Mispredictions
32
BR
ROB
SRF
1 31 1 load
LO
AD
31 33 34
37
– Solution: Tag each entry in the SRF with the id of the branch preceding
the renamer (BT1). When the renamer is squashed, the value is removed from the
SRF and is written to either the ROB (based on the value of Uncommitted_Write bit)
Multiplex the ports to reduce complexity
Handling Branch Mispredictions
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
38
– Maintain the array Branch_Tags– One entry for each ROB slot
Obtaining Branch Tag BT1
Arch. Reg
Phys. Reg.
Location(0-ROB,1-ARF)
0 0 1
1 31 0
2 2 1
3 3 1
4 4 1
5 33 0
LOAD P31, P2, 100BEQ P6, P7, 200SUB P33, P31, P3ADD P34, P33, P4
LOAD R1, R2, 100BEQ R6, R7, 200SUB R5, R1, R3ADD R1, R5, R4
31
Branch_Tags
7
39
– Problem: The instruction whose value was inserted into the SRF can
itself be squashed
– Example:
Handling Branch Mispredictions : Scenario 3
31
LO
AD
SU
B
AD
D32 33
31
ROB
SRF
1 31 1 load
BR
30
40
– Problem: The instruction whose value was inserted into the SRF can
itself be squashed
– Example:
Handling Branch Mispredictions
31 32 33
ROB
SRF
1 31 1 load
BR
30
41
– Solution: Tag each entry in the SRF with the id of the branch preceding
the instruction itself (BT2). Simply remove the value from the SRF if such a branch is
mispredicted
Handling Branch Mispredictions
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
42
– Allow all instructions preceding the faulting instruction to commit
– Squash all instructions following the faulting instruction– Copy the values of ALL valid SRF entries to the ARF.
Supporting Precise Interrupts
Valid ROB idx Data Branch Tag 1
Branch Tag 2
Dest
SRF format
43
CompiledSPEC
benchmarks
Datapathspecs
Performance stats
VLSI layoutdata
SPICEdecks
SPICE
MicroarchitecturalSimulator
Energy/PowerEstimator Power/energy
stats
SPICE measures ofEnergy per transition
Transition counts,Context information
Inter-thread buffers
Data analyzer/Intra-stream analysis
Two separate threads
Experimental Setup
44
0
20
40
60
80
100
bzip2 gap gcc gzip mcf pars perl twolf vort vpr applu apsi art eq mesa mgrid swim wupw
8 entries 16 entries 32 entries 48 entries % of short-lived results
%
Results: Percentage of Values Written into the SRF
40.5% 60.1% 77.5% 82.3% 86.7%
45
0
10
20
30
40
50
60
bzip2 gap gcc gzip mcf pars perl twolf vort vpr applu apsi art eq mesa mgrid swim wupw
8 entries 16 entries 32 entries 48 entries
cycles
Results: Average Time Spent by a Value in the SRF
Average: 12-15 cycles
46
0
20
40
60
80
100
bzip2 gap gcc gzip mcf pars perl twolf vort vpr applu apsi art eq mesa mgrid swim wupw
8 entries 16 entries 32 entries 48 entries % of short-lived results
%
Results: Percentage of Values not copied into the ARF
42.2% 61.9% 79.3% 84.1% 86.7%
47
pJ
Results: Net Energy Reduction
21%16%9%
ROB + additional
logic
ARF
SRF
23%
0
200
400
600
800
Baseline 8 entries 16 entries 32 entries 48 entries
48
– Register Traffic Analysis (Franklin and Sohi, MICRO’92). Studied the useful lifetime of register instances Delaying the writes until 30 more instructions are dispatched, can eliminate
80% of the writes (if perfect knowledge of the last use is available) Buffering 30 most recently generated results avoids 80% of wbks
– Lozano and Gao (MICRO’95) 90% of all results values are short-lived (consumed while in the ROB) Mechanism to avoid commitment of these values and also avoid register
allocation for them is proposed ROB slots are exposed to the compiler in the form of symbolic registers
– Lazy Retirement (Savransky, Ronen, Gonzalez, WCED’02) Hardware-based scheme to avoid unnecessary commitments Copying from the ROB to the ARF is delayed until the ROB slot is reused. In
many cases, the register is invalidated by the newer instruction Additional rename table is needed. About 75% of commits are avoided.
Related Work
49
– Significant power savings & negligible impact on performance
– Sources of power savings: majority of generated results written into small lightly-ported
SRF Unnecessary commitments are avoided Additional logic/ storage needed to do this is simple
– For a 32-entry SRF, more than 77% of writebacks and more than 79% of commitments can be avoided
– This results in the energy savings of 21% on the ROB and the ARF
Conclusions
50
THANK YOU !
This work was supported in part by DARPA through the PAC-C program and NSF
LOW POWER RESEARCH GROUP Department of Computer Science
State University of New YorkBinghamton, NY 13902-6000
http://www.cs.binghamton.edu/~lowpowerParallel Architectures and Compilation Techniques (PACT’03)
October 1st 2003
51
– SRF
– Three bit vectors (same size as the ROB) Renamed Allocated_in_SRF Uncommitted_Write
– 4-bit array Branch_Tags (same size as the ROB)
Complexity of the Solution
top related