evaluating the reuse cache for mobile...

49
Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison 1

Upload: others

Post on 27-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Evaluating the Reuse Cache for mobile processors

Lokesh Jindal, Urmish Thakker, Swapnil HariaCS-752 Fall 2014

University of Wisconsin-Madison

1

Page 2: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Executive SummaryProblem : Mobile SOCs – Area is money!Cache Area = 20-30% of SOC die area

Solutions?Technology Scaling

Expensive, diminishing returnsReuse Caches

Reduced cache size, comparable performance

Results=>50% Area reductions for 5% performance loss

Page 3: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 4: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 5: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Caches in mobile SOCs

Page 6: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Area-Performance Tradeoffs*Pe

rfor

man

ce

Cache Size

Conventional Cache

* Representative Graph

Page 7: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Area-Performance Tradeoffs*Pe

rfor

man

ce

Cache Size

Conventional CacheReuse Cache

* Representative Graph

Page 8: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Motivation

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Page 9: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Reuse Cache : Original idea

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Page 10: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Jorge Albericio, Pablo Ibáñez, Víctor Viñals, and José M. Llabería. 2013. The reuse cache: downsizing the shared last-level cache. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46).

Page 11: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Reuse Cache : good idea for desktop processors (seemingly)

but what about for mobile processors?

Page 12: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Revisiting Reuse Caches for the Mobile Environment

• Questions– Memory characteristics of mobile workloads?– Spatial/Temporal/Reuse locality at L2 level?– Reuse Cache performance for smaller, simpler caches?

Page 13: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Low temporal locality

% Dead Lines (on Fetch) =

% Dead Lines (on Eviction) =

90.38 90.89 90.15 90.34 90.25 90.7785.83 86.45 85.58 85.78 85.72 86.26

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

baidumap bbench frozenbubble k9mail kingsoftoffice netease

% o

f Dea

d Li

nes

AsimBench Benchmarks

Dead Lines (On Load)

Dead Lines (On Eviction)

Fetched Lines #Reused Lines # - 1

Evicted Lines # Lines Evicted Unused#

Page 14: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Low temporal locality

% Dead Lines (on Fetch) =

% Dead Lines (on Eviction) =

90.38 90.89 90.15 90.34 90.25 90.7785.83 86.45 85.58 85.78 85.72 86.26

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

baidumap bbench frozenbubble k9mail kingsoftoffice netease

% o

f Dea

d Li

nes

AsimBench Benchmarks

Dead Lines (On Load)

Dead Lines (On Eviction)

Fetched Lines #Reused Lines # - 1

Evicted Lines # Lines Evicted Unused#

85-90% dead lines!

Page 15: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 16: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Organization

• Tag Array– Inclusive (aids coherence)– Set associative– (Forward) Pointer to data array

entry

8 ways….

Page 17: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Organization

• Tag Array– Inclusive (aids coherence)– Set associative– (Forward) Pointer to data array

entry

• Data Array– Set associative– Stores reused lines– (Reverse) Pointer to tag array

entry

8 ways….

8 ways….

Page 18: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Coherence

• TO-MOESI protocol– Small extension to the MOESI protocol

• New Tag-Only state• Tag+Data states : Modified, Shared, Owned, Exclusive• Transitions triggered by data insertions/evictions• Read the paper!

Page 19: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Replacement

• Independent replacement policies for Tag and Data arrays

• Tag Array– Not Recently Reused (NRR)

• Data Array– Not Recently Used (NRU) + Shadow directory

Page 20: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Shadow Directory• Pomerene et al. proposed shadow directory

– Transient lines that must be flushed quickly – Lines that become active after long periods of inactivity.

Prefetching mechanism for a high speed buffer store, James Herbert Pomerene, Thomas Roberts Puzak, Rudolph Nathan Rechtschaffen, Frank John Sparacio, 1984, Patent EP 0157175 A2

Tag

Tag+Data

Tag

Tag+Data

No Tag

1

2

3

4

Page 21: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Shadow Directory• Pomerene et al. proposed shadow directory

– Transient lines that must be flushed quickly – Lines that become active after long periods of inactivity.

• Shadow bit added in reuse cache to break inefficient cycles

Shadow bit set here

Least priority for eviction

Tag

Tag+Data

Tag

Tag+Data

No Tag

1

2

3

4

Page 22: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 23: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Methodology

Baseline• 32KB/4way L1 I$ + 32KB/4way L1 D$ • 4MB/8way L2$ (Shared LLC)• 8 wide OOO pipeline• 1.4GHZ | 32nm | Low power device (Based on ARM A15 specifications)

Workloads• AsimBench• SPEC CPU 2006• Self-written microbenchmarks(functional verification)

Page 24: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Workload – AsimBench*

• Popular Android applications• 11 diverse applications

• 6 most memory-intensive apps selected for performance evaluation– BBench (Web Browser): Load web pages– K9Mail (Email): Load/Show emails– NeteaseNews (News): Check and load news– KingsoftOffice (Document Process): Open doc/xls/ppt file– BaiduMap (Map): Load map information of a specific area– FrozenBubble (Game): Load game

* Yongbing Huang, Zhongbin Zha, Mingyu Chen, Lixin Zhang. AsimBench: A Mobile Benchmark Suite for Architectural Simulators. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA, March 2014.

Page 25: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 26: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Performance - AsimBench

Simulation length of 5 Billion instructions after boot in Full System mode

0

10

20

30

40

50

60

70

80

90

100

baidumap bbench frozenbubble k9mail kingsoftoffice netease

Nor

mal

ized

IPC

Reuse Cache (4M+1M)

Conventional Cache (4M + 4M)

Average performance loss of 4.16% (not bad!)

Page 27: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Performance – SPEC CPU2006

(Simulation length of 10B instructions in System Emulation mode)

0

10

20

30

40

50

60

70

80

90

100N

orm

aliz

ed IP

C

Reuse Cache(4M+1M)

Conventional Cache (4M+4M)

Average performance drop of 7.17 %

Page 28: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Configuration Exploration (WIP)

1.101.00

1.050.98

1.49

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

Nor

mal

ized

IPC

Reuse 4M + 2M

Reuse 4M + 2M

Page 29: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Multi-core Performance

0

0.5

1

1.5

2

2.5

3

IPC

Reuse Cache (4M + 1M)

Conventional Cache (4M + 4M)

SPEC benchmarks simulated on 4 cores, running 10 Billion instructions each(Full system + multi-core + AsimBench == Too slow)

~15% Degradation

Page 30: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Area Improvements

Conventional (4M + 4M) Reuse Cache (4M + 1M) Reuse Cache (4M + 2M)

Area (in mm2) 2.37 1.18 1.5

Area savings of 50% for the 4+1 reuse cache, 37% for the 4+2 cache(Data generated using Cacti v6.5)

Page 31: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

% Improvement in Power

0

2

4

6

8

10

12

14

% Im

prov

emen

t in

Pow

er

Page 32: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

% Improvement in Power~10 % Improvement

0

2

4

6

8

10

12

14

% Im

prov

emen

t in

Pow

er

+ Reduced Leakage power- Increased DRAM power

Page 33: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Live-ness Analysis - AsimBench

• % Live lines increased from 9.54 % to 47.95%

0

20

40

60

80

100

% L

ive

Line

s

Conventional Cache (4M+4M)

Reuse Cache (4M+1M)

Page 34: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Pictures!

• AsimBench apps running on gem5 …

Page 35: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,
Page 36: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,
Page 37: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,
Page 38: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,
Page 39: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Outline

• Motivation• Implementation• Methodology• Performance Evaluation• Future Directions and Conclusion

Page 40: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Future Directions

• Replacement Policies (NRR + Shadow) for Fully Associative data array

• Explore allotment policies• Better uses for extra tag entries• Longer/Better simulations using simpoints• Multi-core simulations using AsimBench to validate coherence

protocols• Comprehensive energy analysis

Page 41: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Summary

For mobile workloads, reuse cache promises + Significant area reduction (~50%)+ Decent power savings (~10%)- Marginal performance loss (~6%)

Page 42: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Questions?

Thank You!!

Page 43: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

BACKUP

Page 44: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Configuration Exploration

0

50

100

150

200

250

1 2 3 4 5

Pow

er in

mW

Increased Power (mW)

Page 45: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

L2 => 1 MB CacheAnimate this to cross out 1 MB and say 50% of data area

L2 => 4 MB CacheCross out and say Performance of 16 MB Cache

Page 46: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Results

• Mapping exploration -> IPC , ASIM + SPEC• 4+4 vs (4+1) spec , asimbench• Shadow vs NRR • Boot times• Bigger is better• CACti graph• Energy graph• Dead lines

Page 47: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Baseline

• 32KB/4way L1 I$ + 32KB/4way L1 D$ • 4MB/8way L2$ (Shared LLC)• 8 wide OOO pipeline• 1.4GHZ | 32nm | Low power device

• Based on ARM A15 specifications

Page 48: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Challenges

• ARM aarch32 incompatible with gem5’s Ruby memory model

• Simulation infrastructure for ARM in full system mode

Page 49: Evaluating the Reuse Cache for mobile processorspages.cs.wisc.edu/~swapnilh/resources/752_presn.pdf · Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker,

Contributions