density tradeoffs of non-volatile memory as a replacement ... · subramoney, tanay karnik, steven...

Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache

Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas

Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang

• Next generation NVM has potential for bigger and energy

efficient LLC Potential ~3x capacity gain over state-of-art SRAM with logic compatible

process, non-volatility

o Write error rate (WER) target for industry LLC adoption increases write

latency in practice

• Our proposed solutions show good performance gains and

can help make NVM as viable replacement of SRAM for LLC

2

Novel Non-volatile Memory based Last Level Cache

(NVMLLC) Architecture

3

Agenda

• Motivation & problem

• Current solutions

• Our proposals

• Results

NVMs offer

capacity

advantages

over SRAMs

for LLC

● NVMs promise high density

● Spin Torque Transfer (STT)

RAM, Spin Hall Effect (SHE)

MRAM, etc..

● Can build large LLCs

● Significant power/density

benefits over SRAM LLC

4

5

Advantage of increasing LLC capacity

1.00

1.15

1.23

0.90

0.95

1.00

1.05

1.10

1.15

1.20

1.25

Perf

. n

orm

alize

d

to 4

MB

SR

AM

LL

C

SRAM 4MB

SRAM 8MB

SRAM 16MB

6

But, high write latency negates the capacity gains

1.00

1.15 1.13

1.05

0.85

0.60

0.70

0.80

0.90

1.00

1.10

1.20

Perf

orm

an

ce n

orm

alized

to

4M

B S

RA

M L

LC

SRAM 4MB STTRAM 8MB WR +0ns

STTRAM 8MB WR +5ns STTRAM 8MB WR +10ns

STTRAM 8MB WR +20ns

None of the

current

techniques

reduce the

write latency

enough

• Architectural Techniques

• Dead block predictor for

bypassing

• LAP

• Hybrid Cache

• Circuit and Device Techniques

• Increase bit-cell transistor

size, trade-off latency with

retention/higher WER, new

devices, etc

7

Our Proposal:

Reduce Write

Interference

Eliminate Redundant

Writes

1 2

8

9

Reduce Write Interference

• Many programs exhibit long high read/write phases

• Usual Dead Block Predictor based bypassing not sufficient

• Need more aggressive write bypassing to reduce write interference

1

0

50

100

150

200

250

300

113

25

37

49

61

73

85

97

109

121

133

145

157

169

181

193

205

217

229

241

253

265

277

289

301

313

325

337

349

361

373

385

397

409

421

433

445

457

469

481

493N

um

be

r o

f re

qu

es

ts a

rriv

ed

Intervals of 10k cycles

num_writes num_reads

gcc.200

10

Write Congestion Aware Bypassing (WCAB)

Don’t bypass

NO Request queue is

full && pending

writes > write_th

If any read

ready

Send read Send write

NO

Don’t bypass

NOmin_score <=

byp_score_th

Bypass write with min_score

Get average write occupancy calculated in intervals

(int_write_occ)

Refer Lookup Table to find bypass score threshold

(byp_score_th) for int_write_occ

Find pending write with lowest

live score (min_score)

Interval write occupancy(int_write_occ)

Bypass score threshold(byp_score_th)

1/4th of request queue 20%

Half of request queue 50%

3/4th of request queue 70%

Equal to request queue 100%

Lookup Table (Tuned)

write_th 75% of request queue

1

• Significant percentage of frequent clean and dirty fills in LLC

• Dirty fills generate writes in both Exclusive and Inclusive LLC

• Clean fills create writes in Exclusive LLC

11

Eliminates Redundant Writes 2

0%

20%

40%

60%

80%

100%

Percentage of frequent clean and dirty fills in LLC

frequent clean fills frequent dirty fills one time fills

12

Virtual Hybrid Cache (VHC)

• Write Merging in L2

• Frequent dirty lines stay in L2 for longer

• Used existing technique to classify frequent dirty lines

• Many writes merge in L2 reducing fills in LLC

• Relaxed Exclusivity (duplicate lines b/w L2 and LLC)

• Enhancement over LAP for Exclusive Cache

• Retain the duplicate lines near LRU to reduce hit rate loss

• Dirty lines (whenever found) not duplicated in LLC

2

Simulation Methodology &

Results

13

14

Simulation Methodology

• Used modified version Multi2Sim simulating 4 x86 cores

• Core parameters similar to Intel Skylake

• SRAM baseline: 4MB, 4 banks, 16 ways with round trip delay of 20 cycles

• STTRAM baseline: 8MB, 8 banks, additional write latency of 20ns

• Workloads:

• Selected 20 workloads from SPEC 2006 and HPCG

• With High L2 MPKI and a range of LLC MPKIs (Table 1 in the paper)

• 20 homogeneous and 44 heterogeneous (by randomly mixing the 20 workloads)

15

Performance vs STTRAM LLC Baseline

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

Perf

orm

an

ce n

orm

alized

to

S

TT

RA

M 8

MB

baselin

e

WCAB WCAB+VHC

Our proposals provide 26% performance gain over the baseline

16

Performance vs Similar Area SRAM LLC

1.101.07

0.87

0.71

1.12

1.18

1.12

1.03

0.60

0.70

0.80

0.90

1.00

1.10

1.20

1.30

5 ns, 7 MB 10 ns, 12 MB 20 ns, 16 MB 30 ns, 20 MBPe

rfo

rma

nc

e n

orm

alize

dto

SR

AM

4M

B

STT - baseline STT - Proposed Architecture

Our proposals provide up to 18% performance gain over the

SRAM of same area

17

Performance vs Prior Art

1.10 1.09 1.091.05

1.03 1.041.07

1.13 1.11

1.18

1.301.26

0.9

1.0

1.1

1.2

1.3

1.4

Homogeneous Heterogeneous Geomean

Perf

orm

an

ce n

orm

alized

to

8M

B S

TT

RA

M b

aselin

e

Hybrid LLC - 2MB SRAM, 4MB STTRAM Hybrid LLC - 1MB SRAM, 6MB STTRAMSTTRAM LLC - LAP STTRAM LLC - Proposed Architecture

Our proposals perform significantly better than the prior art

18

Conclusions

• Next generation NVM has potential for bigger and energy efficient LLC

• Require architectural solutions to absorb high write latency and obtain

capacity benefits

• Our proposed solutions show good performance gains and can help

make NVM as viable replacement of SRAM for LLC

THANK YOU!!

density tradeoffs of non-volatile memory as a replacement ... · subramoney, tanay karnik, steven...

Documents