dynamic removal of redundant computations

UU PP CC

Dynamic Removal of Redundant Computations

Dynamic Removal of Redundant Computations

Carlos Molina, Antonio González and Jordi Tubella

Universitat Politècnica de Catalunya - Barcelona

{cmolina,antonio,jordit}@ac.upc.es

ICS´99, Rhodes (Greece) - June 20-25, 1999

UU PP CC

for (i=0; i<N; i++)

A[i] = B[i]+C[i];

. . . . .

R = S / T ;

. . . . .

X = S / U ;

. . . . .

MotivationMotivation

Quasi - invariantQuasi-common subexpression

UU PP CC

OutlineOutline

Instruction Reuse

Related Work

Redundant Computation Buffer

Performance Results

Conclusions

UU PP CC

Instruction ReuseInstruction Reuse

FetchDecode

& Rename

CommitOOO

Execution

Reuse

Mechanismindex

UU PP CC

Related WorkRelated Work

Instruction Reuse Value Cache for the Tree Machine (Harbison 82) Result Cache (Richardson 92, Oberman et al. 95) Reuse Buffer (Sodani and Sohi 97) Physical Register Reuse (Jourdan et al. 98)

Trace Reuse Basic blocks (Huang and Lilja 99) General traces (González et al. 99)

UU PP CC

Related WorkRelated Work

Result Cache Richardson 92, Oberman & Flynn 95

– Special purpose (long latency operations)– Indexed by operand values– No reuse chaining– Can reuse dynamic instances of other static instructions

Reuse Buffer Sodani & Sohi 97

– General purpose– Indexed by PC– Reuse chaining– Only reuse dynamic instances of same static instructions

UU PP CC

Redundant Computation BufferRedundant Computation Buffer

Vtable

Atable pointer

opcode result/address opnd1 opnd2 pointer

Atable

address tag result

Mtable

Reuse Test

Reused Value

Reused Memory Value

UU PP CC

RCB (Working Example)RCB (Working Example)

I1: 8 / 2 = 4

Vtable Atable

10: div 8 nil2 4

4

while (cond) { r = s / t ; ...... x = s / u ; }

UU PP CC

20: div 8 2 4 nil


Vtable

10:

Atable

div 8 nil2 4

4

while (cond) { r = s / t ; ...... x = s / u ; } I2: 8 / 2 = 4

UU PP CC

Vtable

10:

Atable

div 8 nil2 4

4

while (cond) { r = s / t ; ...... x = s / u ; } I2: 8 / 2 = 4

20: div 8 2 4


UU PP CC

20: div 8 nil2 4

div 8 nil2 4div 9 nil3 3

Vtable

10:

Atable

4

while (cond) { r = s / t ; ...... x = s / u ; }

I1: 9 / 3 = 3

3

I2: 9 / 3 = 3


UU PP CC

Enhanced Result CacheMtable

address tag result

Atable

opcode result/address opnd1 opnd2Operands

Enhanced Reuse BufferMtableAtable

opcode result/address opnd1 opnd2

address tag result

PC

Enhancements to Other SchemesEnhancements to Other Schemes

UU PP CC

Timing ConsiderationsTiming Considerations

fetch issue commitexecute write backdecode&

rename

opnd read&dispatch

Pipeline Stages

Atablelookup

reuse test

Latency of the Reuse Buffer

1stAtable lookup

reuse test

2ndAtable lookup

Latency of the RCB

Atablelookup

reuse test

Latency of the Result Cache

UU PP CC

Experimental FrameworkExperimental Framework

Simulator Alpha version of the SimpleScalar Toolset

BenchmarksSpec95

Maximum Optimization LevelDEC C & F77 compilers with -non_shared -O5

Statistics Collected for 125 million instructionsSkipping initializations

UU PP CC

Basic Reuse StatisticsBasic Reuse Statistics

We evaluate different schemes- Enhanced Result Cache (ERC)- Enhanced Reuse Buffer (ERB)- Redundant Computation Buffer (RCB)

We find best configuration for each scheme- Number of entries- History depth

Best configurations will be evaluated- Percentage of reuse- Speedup

UU PP CC

Quasi-Common SubexpressionsQuasi-Common Subexpressions

05

1015202530354045

Per

cen

tag

e o

f R

euse

ERB

RCB

32 KB

UU PP CC

Study of Reuse (ERB)Study of Reuse (ERB)

10

15

20

25

30

35

40

45

50

55

Per

cen

tag

e o

f R

euse

16K entries

8K entries

4K entries

2K entries

1K entries

512 entries

256 entries

128 entries

| | | | | | | | |

8 16 32 64 128 256 512 1024 2048 4096

Size in Kbytes

UU PP CC

Study of Reuse (RCB)Study of Reuse (RCB)

15

20

25

30

35

40

45

50

55

60

Per

cen

tag

e o

f R

euse

16K entries

8K entries

4K entries

2K entries

1K entries

512 entries

256 entries

128 entries

| | | | | | | | |

8 16 32 64 128 256 512 1024 2048 4096

Size in Kbytes

UU PP CC

Study of Reuse (Comparative)Study of Reuse (Comparative)

10

20

30

40

50

60

70

Pe

rce

nta

ge

of

Re

us

e

ERB RCB ERC

| | | | | | | | |

8 16 32 64 128 256 512 1024 2048 4096

Size in Kbytes

UU PP CC

Performance EvaluationPerformance Evaluation

Two different capacities are evaluated- 32 KB- 200 KB

Best configuration has been chosen for each reuse scheme

We present a performance evaluation for a supercalar processor

- Speedup- Percentage of reuse

UU PP CC

Instruction fetch 4 instructions per cycle

Branch predictor 2048-entry bimodal predictor

Data cache 16 KB, 2-way-set associative, 32-byte block, 6-cycle miss latency

Instruction cache 16KB, direct mapped, 32 byte cache line, 6-cycle miss latency

Instruction issue/commitOut of order issue, 4 I´s commit per cycle, 32-entry reorder buffer,load execute if preceding stores are known, store-load forwarding

Architected registers 32 integer and 32 FP

Functional units4 integer ALUs, 2 load/store units, 4 FP adders,

1 integer mult/div, 1 FP mult/div

FU latency/repeat timeInteger ALU 1/1, load/store 1/1, integer mult 3/, integer div 20,19,

FP adder 2/1, FP mult 4/1, FP div 12/12

Base MicroarchitectureBase Microarchitecture

UU PP CC

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

H_Mean

ERB

RCB

ERC

Speedup (32 KB)Speedup (32 KB)

1.20

1.10

1.00

1.05

1.15

UU PP CC

Speedup (200 KB)Speedup (200 KB)

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

H_Mean

ERB

RCB

ERC

1.25

1.20

1.15

1.10

1.05

1.00

UU PP CC

0

10

20

30

40

50

60

70

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

0

10

20

30

40

50

60

70

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

Reuse (32 KB)Reuse (32 KB)

Ops ready

UU PP CC

Reuse (200 KB)Reuse (200 KB)

0

10

20

30

40

50

60

70

80

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

0

10

20

30

40

50

60

70

80

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

Ops ready

UU PP CC

0102030405060708090

100

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

0102030405060708090

100

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

0102030405060708090

100

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

0102030405060708090

100

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

ERB

RCB

ERC

Reuse by Instruction CategoryReuse by Instruction Category

Load Value Memory Address Arithmetic Cond Branch

UU PP CC

Hybrid SchemeHybrid Scheme

opco res/addr op1 op2 pointer

Atable

PC Atable

opco res/addr op1 op2 pointerPC

Opnds opco res/addr op1 op2 nilAtable

opcod result/addr opnd1 opnd2 Opnds

UU PP CC

Speedup (Hybrid Scheme)Speedup (Hybrid Scheme)

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

H_Mean

RCB

Hybrid

1.20

1.10

1.05

1.00

1.15

UU PP CC

Reuse (Hybrid Scheme)Reuse (Hybrid Scheme)

0

10

20

30

40

50

60

70

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

A_Mean

RCB

Hybrid

UU PP CC

Speedup (Perfect Reuse Engine)Speedup (Perfect Reuse Engine)

Applu

Compre

ssGcc Go Li

M88

ksim

Mgrid Perl

Swim

Turb3d

Vortex

H_Mean

1.60

1.40

1.80

2.00

2.20

1.20

1.00

UU PP CC

ConclusionsConclusions

Redundant Computation Buffer Quasi-invariants Quasi-common subexpressions

High reuse coverage and low latency 30% reuse 10% speedup Outperforms previous schemes

dynamic removal of redundant computations

Documents

reuse schemewe

kbytesstudy of reuse

reuse buffer sodani

integer multdiv

integer div

result cache richardson

dynamic instances

examplevtableatablewhile