dynamic removal of redundant computations
DESCRIPTION
ICS´99, Rhodes (Greece) - June 20-25, 1999. Dynamic Removal of Redundant Computations. Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona {cmolina,antonio,jordit}@ac.upc.es. Motivation. Quasi-common subexpression. Quasi - invariant. - PowerPoint PPT PresentationTRANSCRIPT
UU PP CC
Dynamic Removal of Redundant Computations
Dynamic Removal of Redundant Computations
Carlos Molina, Antonio González and Jordi Tubella
Universitat Politècnica de Catalunya - Barcelona
{cmolina,antonio,jordit}@ac.upc.es
ICS´99, Rhodes (Greece) - June 20-25, 1999
UU PP CC
for (i=0; i<N; i++)
A[i] = B[i]+C[i];
. . . . .
R = S / T ;
. . . . .
X = S / U ;
. . . . .
MotivationMotivation
Quasi - invariantQuasi-common subexpression
UU PP CC
OutlineOutline
Instruction Reuse
Related Work
Redundant Computation Buffer
Performance Results
Conclusions
UU PP CC
Instruction ReuseInstruction Reuse
FetchDecode
& Rename
CommitOOO
Execution
Reuse
Mechanismindex
UU PP CC
Related WorkRelated Work
Instruction Reuse Value Cache for the Tree Machine (Harbison 82) Result Cache (Richardson 92, Oberman et al. 95) Reuse Buffer (Sodani and Sohi 97) Physical Register Reuse (Jourdan et al. 98)
Trace Reuse Basic blocks (Huang and Lilja 99) General traces (González et al. 99)
UU PP CC
Related WorkRelated Work
Result Cache Richardson 92, Oberman & Flynn 95
– Special purpose (long latency operations)– Indexed by operand values– No reuse chaining– Can reuse dynamic instances of other static instructions
Reuse Buffer Sodani & Sohi 97
– General purpose– Indexed by PC– Reuse chaining– Only reuse dynamic instances of same static instructions
UU PP CC
Redundant Computation BufferRedundant Computation Buffer
Vtable
Atable pointer
opcode result/address opnd1 opnd2 pointer
Atable
address tag result
Mtable
Reuse Test
Reused Value
Reused Memory Value
UU PP CC
RCB (Working Example)RCB (Working Example)
I1: 8 / 2 = 4
Vtable Atable
10: div 8 nil2 4
4
while (cond) { r = s / t ; ...... x = s / u ; }
UU PP CC
20: div 8 2 4 nil
RCB (Working Example)RCB (Working Example)
Vtable
10:
Atable
div 8 nil2 4
4
while (cond) { r = s / t ; ...... x = s / u ; } I2: 8 / 2 = 4
UU PP CC
Vtable
10:
Atable
div 8 nil2 4
4
while (cond) { r = s / t ; ...... x = s / u ; } I2: 8 / 2 = 4
20: div 8 2 4
RCB (Working Example)RCB (Working Example)
UU PP CC
20: div 8 nil2 4
div 8 nil2 4div 9 nil3 3
Vtable
10:
Atable
4
while (cond) { r = s / t ; ...... x = s / u ; }
I1: 9 / 3 = 3
3
I2: 9 / 3 = 3
RCB (Working Example)RCB (Working Example)
UU PP CC
Enhanced Result CacheMtable
address tag result
Atable
opcode result/address opnd1 opnd2Operands
Enhanced Reuse BufferMtableAtable
opcode result/address opnd1 opnd2
address tag result
PC
Enhancements to Other SchemesEnhancements to Other Schemes
UU PP CC
Timing ConsiderationsTiming Considerations
fetch issue commitexecute write backdecode&
rename
opnd read&dispatch
Pipeline Stages
Atablelookup
reuse test
Latency of the Reuse Buffer
1stAtable lookup
reuse test
2ndAtable lookup
Latency of the RCB
Atablelookup
reuse test
Latency of the Result Cache
UU PP CC
Experimental FrameworkExperimental Framework
Simulator Alpha version of the SimpleScalar Toolset
BenchmarksSpec95
Maximum Optimization LevelDEC C & F77 compilers with -non_shared -O5
Statistics Collected for 125 million instructionsSkipping initializations
UU PP CC
Basic Reuse StatisticsBasic Reuse Statistics
We evaluate different schemes- Enhanced Result Cache (ERC)- Enhanced Reuse Buffer (ERB)- Redundant Computation Buffer (RCB)
We find best configuration for each scheme- Number of entries- History depth
Best configurations will be evaluated- Percentage of reuse- Speedup
UU PP CC
Quasi-Common SubexpressionsQuasi-Common Subexpressions
05
1015202530354045
Per
cen
tag
e o
f R
euse
ERB
RCB
32 KB
UU PP CC
Study of Reuse (ERB)Study of Reuse (ERB)
10
15
20
25
30
35
40
45
50
55
Per
cen
tag
e o
f R
euse
16K entries
8K entries
4K entries
2K entries
1K entries
512 entries
256 entries
128 entries
| | | | | | | | |
8 16 32 64 128 256 512 1024 2048 4096
Size in Kbytes
UU PP CC
Study of Reuse (RCB)Study of Reuse (RCB)
15
20
25
30
35
40
45
50
55
60
Per
cen
tag
e o
f R
euse
16K entries
8K entries
4K entries
2K entries
1K entries
512 entries
256 entries
128 entries
| | | | | | | | |
8 16 32 64 128 256 512 1024 2048 4096
Size in Kbytes
UU PP CC
Study of Reuse (Comparative)Study of Reuse (Comparative)
10
20
30
40
50
60
70
Pe
rce
nta
ge
of
Re
us
e
ERB RCB ERC
| | | | | | | | |
8 16 32 64 128 256 512 1024 2048 4096
Size in Kbytes
UU PP CC
Performance EvaluationPerformance Evaluation
Two different capacities are evaluated- 32 KB- 200 KB
Best configuration has been chosen for each reuse scheme
We present a performance evaluation for a supercalar processor
- Speedup- Percentage of reuse
UU PP CC
Instruction fetch 4 instructions per cycle
Branch predictor 2048-entry bimodal predictor
Data cache 16 KB, 2-way-set associative, 32-byte block, 6-cycle miss latency
Instruction cache 16KB, direct mapped, 32 byte cache line, 6-cycle miss latency
Instruction issue/commitOut of order issue, 4 I´s commit per cycle, 32-entry reorder buffer,load execute if preceding stores are known, store-load forwarding
Architected registers 32 integer and 32 FP
Functional units4 integer ALUs, 2 load/store units, 4 FP adders,
1 integer mult/div, 1 FP mult/div
FU latency/repeat timeInteger ALU 1/1, load/store 1/1, integer mult 3/, integer div 20,19,
FP adder 2/1, FP mult 4/1, FP div 12/12
Base MicroarchitectureBase Microarchitecture
UU PP CC
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
H_Mean
ERB
RCB
ERC
Speedup (32 KB)Speedup (32 KB)
1.20
1.10
1.00
1.05
1.15
UU PP CC
Speedup (200 KB)Speedup (200 KB)
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
H_Mean
ERB
RCB
ERC
1.25
1.20
1.15
1.10
1.05
1.00
UU PP CC
0
10
20
30
40
50
60
70
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
0
10
20
30
40
50
60
70
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
Reuse (32 KB)Reuse (32 KB)
Ops ready
UU PP CC
Reuse (200 KB)Reuse (200 KB)
0
10
20
30
40
50
60
70
80
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
0
10
20
30
40
50
60
70
80
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
Ops ready
UU PP CC
0102030405060708090
100
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
0102030405060708090
100
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
0102030405060708090
100
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
0102030405060708090
100
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
ERB
RCB
ERC
Reuse by Instruction CategoryReuse by Instruction Category
Load Value Memory Address Arithmetic Cond Branch
UU PP CC
Hybrid SchemeHybrid Scheme
opco res/addr op1 op2 pointer
Atable
PC Atable
opco res/addr op1 op2 pointerPC
Opnds opco res/addr op1 op2 nilAtable
opcod result/addr opnd1 opnd2 Opnds
UU PP CC
Speedup (Hybrid Scheme)Speedup (Hybrid Scheme)
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
H_Mean
RCB
Hybrid
1.20
1.10
1.05
1.00
1.15
UU PP CC
Reuse (Hybrid Scheme)Reuse (Hybrid Scheme)
0
10
20
30
40
50
60
70
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
A_Mean
RCB
Hybrid
UU PP CC
Speedup (Perfect Reuse Engine)Speedup (Perfect Reuse Engine)
Applu
Compre
ssGcc Go Li
M88
ksim
Mgrid Perl
Swim
Turb3d
Vortex
H_Mean
1.60
1.40
1.80
2.00
2.20
1.20
1.00
UU PP CC
ConclusionsConclusions
Redundant Computation Buffer Quasi-invariants Quasi-common subexpressions
High reuse coverage and low latency 30% reuse 10% speedup Outperforms previous schemes