density tradeoffs of non-volatile memory as a replacement ... · subramoney, tanay karnik, steven...
TRANSCRIPT
Density Tradeoffs of Non-Volatile Memory as a Replacement for SRAM based Last Level Cache
Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas
Subramoney, Tanay Karnik, Steven Swanson, Ian Young and Hong Wang
• Next generation NVM has potential for bigger and energy
efficient LLC Potential ~3x capacity gain over state-of-art SRAM with logic compatible
process, non-volatility
o Write error rate (WER) target for industry LLC adoption increases write
latency in practice
• Our proposed solutions show good performance gains and
can help make NVM as viable replacement of SRAM for LLC
2
Novel Non-volatile Memory based Last Level Cache
(NVMLLC) Architecture
3
Agenda
• Motivation & problem
• Current solutions
• Our proposals
• Results
NVMs offer
capacity
advantages
over SRAMs
for LLC
● NVMs promise high density
● Spin Torque Transfer (STT)
RAM, Spin Hall Effect (SHE)
MRAM, etc..
● Can build large LLCs
● Significant power/density
benefits over SRAM LLC
4
5
Advantage of increasing LLC capacity
1.00
1.15
1.23
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
Perf
. n
orm
alize
d
to 4
MB
SR
AM
LL
C
SRAM 4MB
SRAM 8MB
SRAM 16MB
6
But, high write latency negates the capacity gains
1.00
1.15 1.13
1.05
0.85
0.60
0.70
0.80
0.90
1.00
1.10
1.20
Perf
orm
an
ce n
orm
alized
to
4M
B S
RA
M L
LC
SRAM 4MB STTRAM 8MB WR +0ns
STTRAM 8MB WR +5ns STTRAM 8MB WR +10ns
STTRAM 8MB WR +20ns
None of the
current
techniques
reduce the
write latency
enough
• Architectural Techniques
• Dead block predictor for
bypassing
• LAP
• Hybrid Cache
• Circuit and Device Techniques
• Increase bit-cell transistor
size, trade-off latency with
retention/higher WER, new
devices, etc
7
Our Proposal:
Reduce Write
Interference
Eliminate Redundant
Writes
1 2
8
9
Reduce Write Interference
• Many programs exhibit long high read/write phases
• Usual Dead Block Predictor based bypassing not sufficient
• Need more aggressive write bypassing to reduce write interference
1
0
50
100
150
200
250
300
113
25
37
49
61
73
85
97
109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493N
um
be
r o
f re
qu
es
ts a
rriv
ed
Intervals of 10k cycles
num_writes num_reads
gcc.200
10
Write Congestion Aware Bypassing (WCAB)
Don’t bypass
NO Request queue is
full && pending
writes > write_th
If any read
ready
Send read Send write
NO
Don’t bypass
NOmin_score <=
byp_score_th
Bypass write with min_score
Get average write occupancy calculated in intervals
(int_write_occ)
Refer Lookup Table to find bypass score threshold
(byp_score_th) for int_write_occ
Find pending write with lowest
live score (min_score)
Interval write occupancy(int_write_occ)
Bypass score threshold(byp_score_th)
1/4th of request queue 20%
Half of request queue 50%
3/4th of request queue 70%
Equal to request queue 100%
Lookup Table (Tuned)
write_th 75% of request queue
1
• Significant percentage of frequent clean and dirty fills in LLC
• Dirty fills generate writes in both Exclusive and Inclusive LLC
• Clean fills create writes in Exclusive LLC
11
Eliminates Redundant Writes 2
0%
20%
40%
60%
80%
100%
Percentage of frequent clean and dirty fills in LLC
frequent clean fills frequent dirty fills one time fills
12
Virtual Hybrid Cache (VHC)
• Write Merging in L2
• Frequent dirty lines stay in L2 for longer
• Used existing technique to classify frequent dirty lines
• Many writes merge in L2 reducing fills in LLC
• Relaxed Exclusivity (duplicate lines b/w L2 and LLC)
• Enhancement over LAP for Exclusive Cache
• Retain the duplicate lines near LRU to reduce hit rate loss
• Dirty lines (whenever found) not duplicated in LLC
2
Simulation Methodology &
Results
13
14
Simulation Methodology
• Used modified version Multi2Sim simulating 4 x86 cores
• Core parameters similar to Intel Skylake
• SRAM baseline: 4MB, 4 banks, 16 ways with round trip delay of 20 cycles
• STTRAM baseline: 8MB, 8 banks, additional write latency of 20ns
• Workloads:
• Selected 20 workloads from SPEC 2006 and HPCG
• With High L2 MPKI and a range of LLC MPKIs (Table 1 in the paper)
• 20 homogeneous and 44 heterogeneous (by randomly mixing the 20 workloads)
15
Performance vs STTRAM LLC Baseline
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Perf
orm
an
ce n
orm
alized
to
S
TT
RA
M 8
MB
baselin
e
WCAB WCAB+VHC
Our proposals provide 26% performance gain over the baseline
16
Performance vs Similar Area SRAM LLC
1.101.07
0.87
0.71
1.12
1.18
1.12
1.03
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
5 ns, 7 MB 10 ns, 12 MB 20 ns, 16 MB 30 ns, 20 MBPe
rfo
rma
nc
e n
orm
alize
dto
SR
AM
4M
B
STT - baseline STT - Proposed Architecture
Our proposals provide up to 18% performance gain over the
SRAM of same area
17
Performance vs Prior Art
1.10 1.09 1.091.05
1.03 1.041.07
1.13 1.11
1.18
1.301.26
0.9
1.0
1.1
1.2
1.3
1.4
Homogeneous Heterogeneous Geomean
Perf
orm
an
ce n
orm
alized
to
8M
B S
TT
RA
M b
aselin
e
Hybrid LLC - 2MB SRAM, 4MB STTRAM Hybrid LLC - 1MB SRAM, 6MB STTRAMSTTRAM LLC - LAP STTRAM LLC - Proposed Architecture
Our proposals perform significantly better than the prior art
18
Conclusions
• Next generation NVM has potential for bigger and energy efficient LLC
• Require architectural solutions to absorb high write latency and obtain
capacity benefits
• Our proposed solutions show good performance gains and can help
make NVM as viable replacement of SRAM for LLC
THANK YOU!!