compiler managed partitioned data caches for low power

21
1 University of Michigan Electrical Engineering and Computer Science Compiler Managed Compiler Managed Partitioned Data Caches Partitioned Data Caches for Low Power for Low Power Rajiv Ravindran*, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor * Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California

Upload: kemal

Post on 23-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Compiler Managed Partitioned Data Caches for Low Power. Rajiv Ravindran *, Michael Chu, and Scott Mahlke Advanced Computer Architecture Lab Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler Managed Partitioned Data Caches for Low Power

1 University of MichiganElectrical Engineering and Computer Science

Compiler Managed Partitioned Data Compiler Managed Partitioned Data Caches for Low PowerCaches for Low Power

Rajiv Ravindran*, Michael Chu, and Scott Mahlke

Advanced Computer Architecture LabDepartment of Electrical Engineering and Computer Science

University of Michigan, Ann Arbor

* Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California

Page 2: Compiler Managed Partitioned Data Caches for Low Power

2 University of MichiganElectrical Engineering and Computer Science

Introduction: Memory PowerIntroduction: Memory Power• On-chip memories are a major contributor to system energy• Data caches ~16% in StrongARM [Unsal et. al, ‘01]

Hardware Software

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive – No dynamic adaptability – Conservative

Page 3: Compiler Managed Partitioned Data Caches for Low Power

3 University of MichiganElectrical Engineering and Computer Science

Reducing Data Memory Power:Reducing Data Memory Power:Compiler Managed, Hardware AssistedCompiler Managed, Hardware Assisted

Hardware Software

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive ー No dynamic adaptability ー Conservative

Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations

Page 4: Compiler Managed Partitioned Data Caches for Low Power

4 University of MichiganElectrical Engineering and Computer Science

Data Caches: TradeoffsData Caches: Tradeoffs

Advantages Disadvantages

+ Capture spatial/temporal locality+ Transparent to the programmer+ General than software scratch-pads+ Efficient lookups

– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access

Page 5: Compiler Managed Partitioned Data Caches for Low Power

5 University of MichiganElectrical Engineering and Computer Science

tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

4:1 mux

Replace

• Lookup Activate all ways on every access• Replacement Choose among all the ways

Traditional Cache ArchitectureTraditional Cache Architecture

Page 6: Compiler Managed Partitioned Data Caches for Low Power

6 University of MichiganElectrical Engineering and Computer Science

Partitioned Cache ArchitecturePartitioned Cache Architecture

tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

Ld/St Reg [Addr] [k-bitvector] [R/U]

4:1 mux

Replace

• Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions• Replacement Restricted to partitions specified in bit-vector

P0 P3P2P1

• Advantages Improve performance by controlling replacement Reduce cache access power by restricting number of accesses

Page 7: Compiler Managed Partitioned Data Caches for Low Power

7 University of MichiganElectrical Engineering and Computer Science

Partitioned Caches: ExamplePartitioned Caches: Example

tag data tag datatag data

ld1, st1, ld2, st2 ld5, ld6 ld3, ld4

way-0 way-2way-1

ld1 [100], Rld5 [010], Rld3 [001], R

for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k]}

ld1/st1

ld2/st2

ld3

ld4

ld5

ld6

y w1/w2 x

• Reduce number of tag checks per iteration from 12 to 4 !

Page 8: Compiler Managed Partitioned Data Caches for Low Power

8 University of MichiganElectrical Engineering and Computer Science

Compiler Controlled Data PartitioningCompiler Controlled Data Partitioning• Goal: Place loads/stores into cache partitions• Analyze application’s memory characteristics

– Cache requirements Number of partitions per ld/st– Predict conflicts

• Place loads/stores to different partitions– Satisfies its caching needs– Avoid conflicts, overlap if possible

Page 9: Compiler Managed Partitioned Data Caches for Low Power

9 University of MichiganElectrical Engineering and Computer Science

Cache Analysis: Cache Analysis: Estimating Number of PartitionsEstimating Number of Partitions

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

j-loop k-loop

MM MMB1 B1B1 B1

• M has working-set size = 1

• Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate

• Use the working-set to compute number of partitions

Page 10: Compiler Managed Partitioned Data Caches for Low Power

10 University of MichiganElectrical Engineering and Computer Science

Cache Analysis:Cache Analysis:Estimating Number Of PartitionsEstimating Number Of Partitions

1

1

1

1

8

16

24

32

1 2 3 4

D = 0

.76

.98

1

1

8

16

24

32

1 2 3 4

D = 2

.87

1

1

1

8

16

24

32

1 2 3 4

D = 1

Avoid conflict/capacity misses for an instruction Estimates hit-rate based on

• Reuse-distance (D), total number of cache blocks (B), associativity (A)

Compute energy matrices in reality Pick most energy efficient configuration per instruction

(Brehob et. al., ’99)

Page 11: Compiler Managed Partitioned Data Caches for Low Power

11 University of MichiganElectrical Engineering and Computer Science

Cache Analysis: Cache Analysis: Computing InterferencesComputing Interferences

• Avoid conflicts among temporally co-located references • Model conflicts using interference graph

M4D = 1

M3D = 1

M2D = 1

M1D = 1

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1

Page 12: Compiler Managed Partitioned Data Caches for Low Power

12 University of MichiganElectrical Engineering and Computer Science

Partition AssignmentPartition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique

For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques)

M4D = 1

M3D = 1

M2D = 1

M1D = 1

Clique 2

Clique 1

Clique 1 : M1, M2, M4 New reuse distance (D) = 3Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3

Page 13: Compiler Managed Partitioned Data Caches for Low Power

13 University of MichiganElectrical Engineering and Computer Science

Experimental SetupExperimental Setup• Trimaran compiler and simulator infrastructure• ARM9 processor model• Cache configurations:

– 1-Kb to 32-Kb– 32-byte block size– 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache

• Mediabench suite• CACTI for cache energy modeling

Page 14: Compiler Managed Partitioned Data Caches for Low Power

14 University of MichiganElectrical Engineering and Computer Science

Reduction in Tag & Data-Array ChecksReduction in Tag & Data-Array Checks

0

1

2

3

4

5

6

7

8

1-K 2-K 4-K 8-K 16-K 32-K Average

Aver

age

way

acce

sses

8-part 4-part 2-part

• 36% reduction on a 8-partition cache

Cache size

Page 15: Compiler Managed Partitioned Data Caches for Low Power

15 University of MichiganElectrical Engineering and Computer Science

Improvement in Fetch EnergyImprovement in Fetch Energy

0

10

20

30

40

50

60

rawc

audi

o

rawd

audi

o

g721

enco

de

g721

deco

de

mpe

g2de

c

mpe

g2en

c

pegw

itenc

pegw

itdec

pgpe

ncod

e

pgpd

ecod

e

gsm

enco

de

gsm

deco

de epic

unep

ic

cjpeg

djpe

g

Aver

age

Perc

enta

ge e

nerg

y im

prov

emen

t 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way

16-Kb cache

Page 16: Compiler Managed Partitioned Data Caches for Low Power

16 University of MichiganElectrical Engineering and Computer Science

SummarySummary• Maintain the advantages of a hardware-cache• Expose placement and lookup decisions to the compiler

– Avoid conflicts, eliminate redundancies• 24% energy savings for 4-Kb with 4-partitions• Extensions

– Hybrid scratch-pad and caches– Disable selected tags convert them to scratch-pads– 35% additional savings in 4-Kb cache with 1 partition as SP

Page 17: Compiler Managed Partitioned Data Caches for Low Power

17 University of MichiganElectrical Engineering and Computer Science

Thank You&

Questions

Page 18: Compiler Managed Partitioned Data Caches for Low Power

18 University of MichiganElectrical Engineering and Computer Science

Cache Analysis Cache Analysis Step 1: Instruction FusioningStep 1: Instruction Fusioning

• Combine ld/st that accesses the same set of objects• Avoids coherence and duplication• Points-to analysis

M1 M2

for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k]}

ld1/st1

ld2/st2

ld3

ld4

ld5

ld6

Page 19: Compiler Managed Partitioned Data Caches for Low Power

19 University of MichiganElectrical Engineering and Computer Science

Partition AssignmentPartition Assignment• Greedily place instructions based on its cache estimates• Overlap instructions if required• Compute number of partitions for overlapped instructions

– Enumerate cliques within interference graph– Compute combined working-set of all cliques

• Assign the R/U bit to control lookup

M4D = 1

M3D = 1

M2D = 1

M1D = 1

Clique 2

Clique 1

Page 20: Compiler Managed Partitioned Data Caches for Low Power

20 University of MichiganElectrical Engineering and Computer Science

Related WorkRelated Work• Direct addressed, cool caches [Unsal ’01, Asanovic ’01]

– Tags maintained in registers that are addressed within loads/stores• Split temporal/spatial cache [Rivers ’96]

– Hardware managed, two partitions• Column partitioning [Devdas ’00]

– Individual ways can be configured as a scratch-pad– No load/store based partitioning

• Region based caching [Tyson ’02]– Heap, stack, globals– More finer grained control and management

• Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99]– Reduce tag check power– Compromises on cycle time– Orthogonal to our technique

Page 21: Compiler Managed Partitioned Data Caches for Low Power

21 University of MichiganElectrical Engineering and Computer Science

Code Size OverheadCode Size Overhead

0

2

4

6

8

10

12

rawc

audi

o

rawd

audi

o

g721

enco

de

g721

deco

de

mpe

g2de

c

mpe

g2en

c

pegw

itenc

pegw

itdec

pgpe

ncod

e

pgpd

ecod

e

gsm

enco

de

gsm

deco

de epic

unep

ic

cjpeg

djpe

g

Aver

age

Perc

enta

ge in

stru

ctio

ns

Annotated LD/STs Extra MOV instructions15% 16%