link-time path-sensitive memory redundancy elimination

32
Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa {mfernand,roger}@ac.upc.es Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

Upload: december

Post on 21-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Link-Time Path-Sensitive Memory Redundancy Elimination. Manel Fern á ndez and Roger Espasa {mfernand,roger}@ac.upc.es Computer Architecture Department Universitat Polit è cnica de Catalunya Barcelona, Spain. Motivation. The memory “gap” Processor speed increases faster than memory speed - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Link-Time Path-Sensitive Memory Redundancy Elimination

Link-Time Path-Sensitive Memory Redundancy

Elimination

Manel Fernández and Roger Espasa{mfernand,roger}@ac.upc.es

Computer Architecture Department

Universitat Politècnica de Catalunya

Barcelona, Spain

Page 2: Link-Time Path-Sensitive Memory Redundancy Elimination

Motivation

The memory “gap” Processor speed increases faster than memory speed

L1-cache latency continues to increase Memory operations remain a significant bottleneck

Memory redundancy Instructions that repeatedly access the same location

Lots of memory operations are redundant Hardware designers exploit memory redundancy

E.g., caches take advantage of temporal reuse

The compiler must be very aggressive in

memory optimizations

Page 3: Link-Time Path-Sensitive Memory Redundancy Elimination

Memory redundancy

Memory instructions that repeatedly

access the same location Lots of memory operations are redundant

Sources of redundancy Source code structure

Programmers introduce redundancy

Traditional compilation Separate compilation units Limitations in the compilation model Code generation introduces redundancy

What percentage of memory

operations are redundant at run time?

… = *p;if ( … ){ *q = … … = *p;}

redundantload

redundancysource intervening

store

Page 4: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic memory redundancy

0

10

20

30

40

50

60

70

80

90

100

2 4 8 16 32 64 128 256 512 1024

Redundancy window size (entries)

Dy

na

mic

lo

ad

/sto

re r

ed

un

da

nc

y (

%)

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Average

Loadredundancy

Storeredundancy

Page 5: Link-Time Path-Sensitive Memory Redundancy Elimination

Talk outline

Motivation

Memory redundancy elimination (MRE)

Evaluation

Summary

Page 6: Link-Time Path-Sensitive Memory Redundancy Elimination

Memory redundancy elimination (MRE)

Removal of memory instructions that repeatedly

access the same location Targeted at redundancy type

Load redundancy elimination (LRE) in a path-sensitive fashion– Based on path-sensitive memory disambiguation

Store redundancy elimination (SRE) Targeted at redundancy distance

Eliminating close/distant redundancy

In the context of a binary optimizer Overcome limitations of traditional compilers Need to deal with “executable code” problems

Page 7: Link-Time Path-Sensitive Memory Redundancy Elimination

Load redundancy elimination (LRE)

Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE

Profile information is needed

Eliminating close redundancy Within extended basic blocks (EBBs)

Eliminating distant redundancy Intraprocedural dataflow analysis

[HorspoolHo97] For fully/partially-redundant loads

Redundancy on all/some paths Partial-LRE requires insertion of

speculative loads

R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97

Hot Path

move r0 , r2---------------

...I1 load (p0), r1 move r1 , r0 ...

...

I2 load (p0), r2 ...

Page 8: Link-Time Path-Sensitive Memory Redundancy Elimination

Memory disambiguation

Register use-def chains Symbolic descriptors for every use Disambiguation by instruction inspection

Fails on path-sensitive redundancies

Need to deal with

path-sensitive information Partial-LRE is not sufficient either

...I0 def p0 ...I1 load (p0),r1 ...

... I3 add p0,8,p0 ...

IØ Ø-def p0 ... I2 load (p0),r2 ...

II21

IIIII2

I1

p0p0

)8p0,p0()p0,p0(p0

p0

0

0030

0

SS

S

S

00

00

II21

I2I1

p0p0

p0 ,p0

SS

SS

?

Page 9: Link-Time Path-Sensitive Memory Redundancy Elimination

Path-sensitive memory

disambiguation Established for only a subset of all the

possible paths Subsumes generic disambiguation

Path-sensitive LRE Partial-LRE is now adapted for dealing

with path-sensitive redundancies Availability on edge (AVEDGij)

Path-sensitive redundancy

...I0 def p0 ...I1 load (p0),r1 move r1, r0 ...

... I3 add p0,8,p0 load (p0),r0 ...

IØ Ø-def p0 ... move r0, r2I2 load (p0),r2 ...

---------------

8p0p0

p0p0

)8p0,p0( ,p0

00

2

00

1

000

II21

II21

II2I1

psps

psps

SS

SS

SS

x

Page 10: Link-Time Path-Sensitive Memory Redundancy Elimination

Store redundancy elimination (SRE)

...I1 store r1, (p0) ...I2 store r2, (p0) ...

----------------

Similar approach than LRE SRE on EBBs Full- and Partial-SRE

New formulation of the analysis No path-sensitive elimination!

Elimination of dead stores Other optimizations produce a lot

of dead stores Form of dead code elimination Based on heuristics

Includes a basic analysis for useless stack locations

...I1 load (p0), r0 ...I2 store r0, (p0) ...

----------------

Page 11: Link-Time Path-Sensitive Memory Redundancy Elimination

Talk outline

Motivation

Memory redundancy elimination (MRE)

Evaluation

Summary

Page 12: Link-Time Path-Sensitive Memory Redundancy Elimination

Methodology

Benchmark suite SPECint95

Compiled on an AlphaServer with full optimizations Intrumented using Pixie to get profiling information Aggressively re-optimized using Alto

Experimental framework Alto executable optimizer

Evaluation Dynamic number of loads/stores Actual execution time

AlphaServer GS-140, Alpha EV6-21264

Page 13: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic number of loads/stores

Dynamic number of loads

60%

65%

70%

75%

80%

85%

90%

95%

100%

go

m88

ksim gc

c

com

pres

s liijp

eg perl

vorte

x

Gmea

n

Benchmark

Dynamic number of stores

60%

65%

70%

75%

80%

85%

90%

95%

100%

go

m88

ksim gc

c

com

pres

s liijp

eg perl

vorte

x

Gmea

n

Benchmark

Basic

Full

Partial

Complete

Page 14: Link-Time Path-Sensitive Memory Redundancy Elimination

Execution time

60%

65%

70%

75%

80%

85%

90%

95%

100%

go m88ksim gcc compress li ijpeg perl vortex Gmean

Benchmark

Ex

ec

uti

on

tim

e

Basic

Full

Partial

Complete

Relative execution time on an AlphaServer GS-140, Alpha EV6-21264 525MHz

Page 15: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic replay traps

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

go m88ksim gcc compress li ijpeg perl vortex Gmean

Benchmark

Dy

na

mic

Alp

ha

21

26

4 r

ep

lay

tra

ps

Basic

Full

Partial

Complete

Relative number of replay traps on the sim-alpha simulator, modeling an Alpha EV6-21264

Page 16: Link-Time Path-Sensitive Memory Redundancy Elimination

Talk outline

Motivation

Memory redundancy elimination (MRE)

Evaluation

Summary

Page 17: Link-Time Path-Sensitive Memory Redundancy Elimination

Summary

A high percentage of memory operations are redundant

Memory redundancy elimination (MRE) Removal of redundant memory operations

Load redundancy elimination (LRE) in a path-sensitive fashion– Based on path-sensitive memory disambiguation

Store redundancy elimination (SRE)– Including elimination of dead stores

For executable code or link-time Overcome limitations of traditional compilers

Valuable results on real execution time

Future directions Explore better alias analysis mechanism Additional techniques for MRE

Page 18: Link-Time Path-Sensitive Memory Redundancy Elimination

Backup slides

Page 19: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic memory redundancy

Dynamic load redundancy (%)

0

10

20

30

40

50

60

70

80

90

100

2 4 8 16 32 64 128 256 512 1024

Redundancy window size (entries)

Dynamic store redundancy (%)

0

5

10

15

20

25

30

35

40

2 4 8 16 32 64 128 256 512 1024

Redundancy window size (entries)

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Average

Page 20: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic load redundancy

0

10

20

30

40

50

60

70

80

90

100

2 4 8 16 32 64 128 256 512 1024

Redundancy window size (entries)

Dy

na

mic

lo

ad

re

du

nd

an

cy

(%

)

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Average

Page 21: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic store redundancy

0

5

10

15

20

25

30

35

40

2 4 8 16 32 64 128 256 512 1024

Redundancy window size (entries)

Dy

na

mic

sto

re r

ed

un

da

nc

y (

%)

go

m88ksim

gcc

compress

li

ijpeg

perl

vortex

Average

Page 22: Link-Time Path-Sensitive Memory Redundancy Elimination

Load redundancy elimination (LRE)

I1 loads a value from

memory into r1

I2 loads from the same

location into r2

Location (p0) is not

modified between I1

and I2

r1 can be safely

bypassed to r2

...I1 load (p0), r1

...

I2 load (p0), r2 ...

move r1 , r0 move r0 , r2---------------

I2 can be removed!

Page 23: Link-Time Path-Sensitive Memory Redundancy Elimination

LRE on executable code

Is (p1) at I1 the same

memory location than

(p2) at I2?

Is there any available

register between I1 and

I2 that can be used to

bypass r1 to r2?

...I1 load (p1), r1

...

I2 load (p2), r2 ...

Alias analysis!

Register liveness

analysis!

move r1 , r0 move r0 , r2---------------

Page 24: Link-Time Path-Sensitive Memory Redundancy Elimination

LRE: Eliminating close redundancy

For extended basic blocks (EBBs) Alias analysis: for disambiguation Register live analysis: for bypassing

Profile-guided LRE There is not always a benefit in

removing a redundant load

Hot Path

BCLRE

BBBBlatC

BBlatBfreqfreq

move

freqload

21

2

Need to evaluate cost-benefit of

applying LRE! move r0 , r2---------------

...I1 load (p0), r1 move r1 , r0 ...

...

I2 load (p0), r2 ...

Page 25: Link-Time Path-Sensitive Memory Redundancy Elimination

LRE: Eliminating distant redundancy

For eliminating fully- and

partially- redundant loads Requires insertion of speculative loads

Dataflow analysis [HorspoolHo97] Extended cost equation

Complex search for available registers

...

...

I2 load (p0),r1 ...I1 store r1 ,(p0)

...

load (p0), r0

move r0 ,r1----------------

move r1 ,r0

insertbypass

m

i

freqsrcloadinsert

n

i

freqsrc

freqredmovebypass

CCC

EDGlatC

BBBBlatC

i

i

1

1

R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97

Page 26: Link-Time Path-Sensitive Memory Redundancy Elimination

Load redundancy elimination (LRE)

Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE

Profile information is needed

Eliminating close redundancy Within extended basic blocks (EBBs)

Eliminating distant redundancy Intraprocedural dataflow analysis

[HorspoolHo97] For fully/partially-redundant loads Partial-LRE requires insertion of

speculative loads

R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97

Hot Path

move r0 , r2---------------

...I1 load (p0), r1 move r1 , r0 ...

...

I2 load (p0), r2 ...

Page 27: Link-Time Path-Sensitive Memory Redundancy Elimination

Path-sensitive LRE

Path-sensitive redundancy Redundancy occurs only on some

execution paths Partial-LRE is not sufficient

Memory disambiguation Using register use-def chains Symbolic descriptors for every use

Path-sensitive memory

disambiguation is needed!

...I0 def p0 ...I1 load (p0),r1 ...

... I3 add p0,8,p0 ...

IØ Ø-def p0 ... I2 load (p0),r2 ...

21

IIIII2

I1

)8p0,p0()p0,p0(p0

p0

0030

0

SS

S

S

Page 28: Link-Time Path-Sensitive Memory Redundancy Elimination

Path-sensitive information Disambiguation is established for only

a subset of all the possible paths For detecting path-sensitive exact

memory dependencies

Partial-LRE Algorithm is now adapted for dealing

with path-sensitive redundancies Availability on edge (AVEDGij)

Path-sensitive memory disambiguation

...I0 def p0 ...I1 load (p0),r1 move r1, r0 ...

... I3 add p0,8,p0 load (p0),r0 ...

IØ Ø-def p0 ... move r0, r2I2 load (p0),r2 ...

---------------

8p0p0

p0p0

)8p0,p0(

p0

00

2

00

1

00

0

II21

II21

II2

I1

psps

psps

SS

SS

S

S

x

Page 29: Link-Time Path-Sensitive Memory Redundancy Elimination

A combined algorithm

Short-distance MRE Basic

MRE within EBBs

Long-distance MRE Full

Full-MRE Partial

Partial-MRE Complete

Path-sensitive LRE Partial SRE Dead store elimination

Easy optimizations(including Basic-MRE)

Easy optimizations(including Basic-MRE)

Function inliningFunction inlining

Long-distance MRE(Full/Partial/Complete)

Long-distance MRE(Full/Partial/Complete)

Easy optimizations(including Basic-MRE)

Easy optimizations(including Basic-MRE)

Easy optimizations(including Basic-MRE)

Easy optimizations(including Basic-MRE)

Page 30: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic number of loads

60%

65%

70%

75%

80%

85%

90%

95%

100%

go m88ksim gcc compress li ijpeg perl vortex Gmean

Benchmark

Dy

na

mic

nu

mb

er

of

loa

ds

Basic

Full

Partial

Complete

Page 31: Link-Time Path-Sensitive Memory Redundancy Elimination

Dynamic number of stores

60%

65%

70%

75%

80%

85%

90%

95%

100%

go m88ksim gcc compress li ijpeg perl vortex Gmean

Benchmark

Dy

na

mic

nu

mb

er

of

sto

res

Basic

Full

Partial

Complete

Page 32: Link-Time Path-Sensitive Memory Redundancy Elimination

Alpha 21264 results

Execution time

60%

65%

70%

75%

80%

85%

90%

95%

100%

go

m88

ksim gc

c

com

pres

s liijp

eg perl

vorte

x

Gmea

n

Benchmark

Dynamic number of replay traps

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

go

m88

ksim gc

c

com

pres

s liijp

eg perl

vorte

x

Gmea

n

Benchmark

Basic

Full

Partial

Complete