ractical nonvolatile multilevel ell phase change … · •proposal: 3lc pcm –simple, genuinely...

Post on 15-Aug-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PRACTICAL NONVOLATILE MULTILEVEL-CELL

PHASE CHANGE MEMORY

Jichuan Chang,

Robert S. Schreiber,

Norman P. Jouppi

Hewlett-Packard Labs

Doe Hyun Yoon

IBM T. J. Watson Research Center

MEMORY CAPACITY CHALLENGE IN HPC

• DRAM as main memory

– Scaling is slowing down

• Hard to meet ever-increasing capacity demand

• Byte-addressable nonvolatile memory

– Phase change memory (PCM), memristor, …

– Scales better than DRAM

– Multilevel-cell (MLC) capability

– Nonvolatility

• Checkpoint, in-situ post processing

• High-performance file system

• NV MLC PCM for continued capacity scaling 2

MAJOR CHALLENGE: RESISTANCE DRIFT • Conventional 4-Level-Cell (4LC) Designs

– Naïve 4LC is useless

– Optimized 4LC is only barely usable

– Still need refresh -- it’s volatile memory

• Observation: Most errors in 4LC occur in one cell state

• Proposal: 3-Level-Cell (3LC) PCM – Simple, genuinely nonvolatile (>10 years retention)

– 3-ON-2 and mark-and-spare • Low-cost wearout tolerance for 3LC

– 1.41 bits/cell (vs. 1.52 in 4LC) • Only 7% lower capacity than (volatile) 4LC

3

PCM AND RESISTANCE DRIFT

4

PHASE CHANGE MEMORY • Best of DRAM and Flash

– Higher capacity, better scaling (vs. DRAM)

– Faster, byte-addressable NVM (vs. Flash)

• MLC (Multilevel-Cell) capability

– Store more than 1 bits per cell

• Ex) 2 bits per cell

• Caveats:

– Slow, low-bandwidth write

– Finite write endurance

– Resistance drift 5

Common problems

in both SLC and MLC

RESISTANCE DRIFT • PCM Cell resistance increases over time

– R(t), cell resistance at time t (t >0)

• A cell is programmed at t =0

• Sensed as R0 at time t0 (>0)

• : drift rate (0<<1)

• Drift errors

– Negligible in SLC PCM

– Major reliability problem in MLC PCM

6

0

0)(t

tRtR

DRIFT ERRORS IN 4LC PCM • 4 cell states: S1, S2, S3, S4

– PDF is truncated Gaussian

• ±2.75 around mean values

• Mean resistance values: 1, 2, 3, 4

– Threshold between states: 1, 2, 3

• Drift rate () increases with cell resistance

7 log10R

3.5 3 2.5 6.5 4 4.5 5 5.5 6

S1 S2 S3 S4

1 2 3

1 2 3 4

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

DRIFT ERROR RATES • Monte-Carlo simulation

• Errors only in S2 and S3

8

S2

S3

Time

Fra

ctio

n o

f c

ells

with

an

err

or

REFRESH • Refresh before cells loose their data

– Consume already limited PCM write BW

– Too frequent refresh will make PCM unavailable

to users

• What PCM refresh interval is acceptable?

– At least 50% of write BW should be

available to users

– Refresh interval >17 minutes

• Caveat: PCM w/ refresh is no longer nonvolatile

9

CELL ERROR RATE • What cell error rate is tolerable?

– Goal: 10-year device MTBF

• Fewer than 1 erroneous 64B block

in a 16GB device for 10 years

– CER >1e-2

• Impossible to achieve the goal

even with unrealistically strong ECC

– CER ~1e-3 @ 17min refresh

• Barely meets the goal with BCH-10

• More analysis in the paper

10

BASELINE 4LC PCM

11

NAÏVE DESIGN: 4LCN

• Equal probability for all 4 states

• 17min refresh caps CER at ~1e-2

12

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

CER~1e-2

Unacceptable

Refresh interval > 17 min

4LCn

13

OPTIMAL STATE MAPPING • Drift only increases cell resistance

• Optimize 2, 3, 1, 2, 3 to minimize CER

– minimize CER(2, 3, 1, 2, 3)

– subject to i+2.75+<i< i+1-2.75-

– for i=1,2,3

0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

S1 S4 S2 S3

Simple

mapping Optimal

mapping

1 2 3

minimum

spacing

1 2 3 4

OPTIMAL STATE MAPPING: 4LCO

• CER ~1e-3 @ 17-min refresh

• With BCH-10, it meets the goal

14

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

4LCo

CER~1e-3, barely usable

with 10-bit correcting ECC

4LCn

Refresh interval > 17 min

PROPOSAL:

3LC PCM

15

0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

S3

3 0

0.5

1

1.5

2

2.5

2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5

pd

f o

f ce

ll re

sis

tan

ce

PROPOSAL: 3LC PCM

16

Wide margin

S1 S2 S4

1 2

Simple

mapping Optimal

mapping

• Observation:

– Most errors occur in one state (S3)

• DO NOT USE IT

– Wide Margin for S2

• Simple and optimal mapping (3LCn & 3LCo)

1 2 4

3LC DESIGNS (3LCN AND 3LCO) • Reliable for >10 years w/o ECC & refresh

• Genuinely nonvolatile

17

1E-10

1E-09

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03

1E-02

1E-01

1E+00

2s 32s 17min 9hour 12day 1year 34year 1089year 34865year

Fra

ctio

n o

f c

ells

with

an

err

or

Refresh Interval

4LCo

3LCn

3LCo

CER~1e-9 at 16 years

No ECC, No refresh

4LCn

3LC PCM DESIGN ISSUES • How to store information?

– Binary information in ternary cells

• What about wearout failures?

• How to compensate for

the reduced cell density?

– 3LC’s ideal capacity is 1.58 bits/cell (log23)

– vs. 2 bits/cell in 4LC

18

HOW TO STORE BINARY INFO IN TERNARY CELLS?

• 3-ON-2

– Store three bits in two ternary cells

– 64B (512-bit) data block in 342 cells

• 9 states in 2 ternary cells

• 8 states for 3-bit data

• INVALID state

– (S4, S4)

– Use this for tolerating

wearout failures

19

First

cell

Second

cell

3-bit

data

S1 S1 000

S1 S2 001

S1 S4 010

S2 S1 011

S2 S2 100

S2 S4 101

S4 S1 110

S4 S2 111

S4 S4 INVALID

TOLERATING WEAROUT FAILURES IN 3LC

• PCM has only finite write endurance

– ~108 writes per cell

• Mark-and-spare

– A low-cost wearout failure tolerance for 3LC

– Use 3LC’s INVALID state for marking a cell pair with

a failure

– No need to store failed-cell location

– 2 spare cells per failure

• c.f. ECP [Schechter+ ISCA’10 ]

– Need a pointer and a spare cell for a failure

– 5 cells per failure with 512-bit data block and 4LC 20

• Use INVALID (S4, S4) to mark

a cell pair w/ failure

– A stuck-at cell stuck can be revived by

applying reverse current [Goux+ IEEE TED’09]

• Need a spare pair for tolerating a failure

A pair w/

failure

MARK-AND-SPARE EXAMPLE

21

Wearout

failure

A ternary cell A cell pair

for 3 bits

D0 D1 D2 D3 D4 D5 D6 D7 S0 S1

HOW TO CORRECT WEAROUT FAILURES?

22

342

256 31

10

50

0 50 100 150 200 250 300 350 400

3LC

4LC

342

256

12

31

10

50

0 50 100 150 200 250 300 350 400

3LC

4LC

CAPACITY: 3LC VS. 4LC

23

• 64B (512-bit) block

• 3LC needs fewer bits than 4LC for error correction

– 6 wearout failures:

Mark-and-spare (2cells/failure) vs. ECP (5cells/failure)

– Drift errors: BCH-1 vs. BCH-10

• 3LC: 1.41 bits/cell, 4LC: 1.52 bits/cell

• Besides, 3LC is nonvolatile

7%

Data Wearout failure correction Drift error correction

# cells

CAPACITY VS. # WEAROUT FAILURES • MLC has worse endurance than that of SLC

• May need to tolerate more than

6 wearout failures

24

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

bits/c

ell

# Wearout failures tolerated

4LC

3LC

COMPARISON TO TRI-LEVEL-CELL PCM

25

• Recent work on MLC drift errors [ISCA’13] – Same observation

• Most errors occur in the S3 state

– Same solution • Use 3 levels instead of 4 levels

• TLC paper does not address – Wearout failures

– Optimal resistance/threshold mapping • Baseline 4LC is overly pessimistic – not usable at all

• Unique feature in TLC paper – Bandwidth-Enhanced writes

MLC PCM FOR CONTINUED CAPACITY SCALING

• Major challenge: resistance drift

• Conventional 4LC PCM is not practical – Strong ECC and frequent refresh:

• Performance/power penalty

• Loose nonvolatility

• Proposal: 3LC PCM – Simple, genuinely nonvolatile

– 3-ON-2 & Mark-and-spare • Low-cost wearout tolerance mechanism for 3LC

– Only 7% lower capacity than (volatile) 4LC

• Generalized non-power-of-two level cells – 5LC, 6LC, …

26

top related