design tradeoffs for ssd reliability · • y. cai et al, “data retention in mlc nand flash...

40
Design Tradeoffs for SSD Reliability Bryan S. Kim , Jongmoo Choi, Sang Lyul Min Seoul National University, Dankook University

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Design Tradeoffs forSSD Reliability

Bryan S. Kim, Jongmoo Choi, Sang Lyul Min

Seoul National University, Dankook University

Page 2: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

High-level objectives

Understand the SSD-internal mechanisms behind fail-slow symptoms

• H. Gunawi et al, “Fail-slow at scale: evidence of hardware performance faults in large production systems”, FAST 2018

Page 3: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

High-level objectives

Examine SSD-internal reliability enhancement techniques

• Images from Google searches

Page 4: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

High-level objectives

Think about system- and device-level approachesfor handling errors

• Images from Google searches

Page 5: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

How bad is it?

5

• L. Grupp et al, “Characterizing flash memory: anomalies, observations, and applications”, Micro 2009

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

2008 2010 2012 2014 2016 2018 2020

Err

or

rate

mea

sure

ment

Year published

Page 6: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

How bad is it?

6

• H. Sun et al, “Quantifying reliability of solid-state storage from multiple aspects”, SNAPI 2011• Y. Cai et al, “Error patterns in MLC NAND flash memory: measurement, characterization, and analysis”, DATE 2012

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

2008 2010 2012 2014 2016 2018 2020

Err

or

rate

mea

sure

ment

Year published

Page 7: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

How bad is it?

7

• Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015• Data from an industry partner, 2018

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

2008 2010 2012 2014 2016 2018 2020

Err

or

rate

mea

sure

ment

Year published

Page 8: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

SSD’s reliability issue

Error-prone

memory

ReliableSSD

RBER: 10-4~10-2 UBER: <10-15

• How to make SSD reliable?

• Performance overhead?

• Across different chips and wear states?

8

Page 9: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory errors

Wear-out

CG

FG

Vprog

9

Page 10: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory errors

Wear-out Retention loss

CG

FG

Vprog

CG

FG

0V

10

-- --

Page 11: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory errors

Wear-out Retention loss Disturbance

CG

FG

Vprog

CG

FG

0V

CG

FG

Vpass

11

-- -- --

Page 12: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory error modeling

RBER (cycles, time, reads)= ε+ α ∙ cyclesk

+ β ∙ cyclesm ∙ timen

+ γ ∙ cyclesp ∙ readsq

• N. Mielke et al, “Reliability of solid-state drives based on NAND flash memory”, Proceedings of the IEEE, 2017

12

Page 13: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

From measurements to model

• H. Sun et al, “Quantifying reliability of solid-state storage from multiple aspects”, SNAPI 2011• Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015• Y. Cai et al, “Read disturb errors in MLC NAND flash memory: characterization, mitigation, and recovery”, DSN 2015• Data from an industry partner, 2018

Measurement (data) Model

• 3x-nm MLC (2011)

• 2y-nm MLC (2015)

• 3D TLC (2018)

13

Page 14: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

1E0

Raw

bit

err

or

rate

Wear Retention Disturbance

Error model: 3x-nm MLC (2011)

Wear up to10K P/E cycles

10K P/E cycles +up to 10K readsor up to 1 year

14

Page 15: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Error model: 2y-nm MLC (2015)

15

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

1E0

Raw

bit

err

or

rate

Wear Retention Disturbance

Wear up to10K P/E cycles

10K P/E cycles +up to 10K readsor up to 1 year

Page 16: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Error model: 3D TLC (2018)

16

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

1E0

Raw

bit

err

or

rate

Wear Retention Disturbance

Wear up to10K P/E cycles

10K P/E cycles +up to 10K readsor up to 1 year

Page 17: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

SSD reliability enhancements

• Error correction code

• Data re-reads

• Intra-SSD redundancy

• Background relocation

17

Page 18: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Error correction code

ECCencoder

ECC decoder

Flash memory

18

Page 19: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Error correction code

Data DataP

ECCencoder

ECC decoder

Flash memory

19

Page 20: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Error correction code

ECCencoder

ECC decoderData Data P

Flash memory

20

Page 21: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory

Data re-reads

ECCencoder

ECC decoder Data P

21

Page 22: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Flash memory

Data re-reads

ECCencoder

ECC decoder Data P

1 2

22

Page 23: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Data re-reads

ECCencoder

ECC decoder

Data P

Flash memory

Data

23

Page 24: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Summary: ECC and data re-reads

• Error correction code

– Predictable performance

– Is fixed at design-time

• Data re-read

– Is much more powerful than ECC

– Increases latency for correcting errors

24

Page 25: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Evaluation: data re-readFor the 3D TLC (2018)

25

0

0.5

1

1.5

2

2.5

3

3.5

25-bit 50-bit 75-bit 100-bit

No

rm.

av

g.

RT

ECC correction strength

1K cycles 3K cycles 5K cycles

Page 26: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Why is data re-read bad?For the 3D TLC (2018)

26

0.5

0.6

0.7

0.8

0.9

1

0 25 50 75 100 125 150 175

Cu

mu

lati

ve

pro

bab

ilit

y

Number of raw bit errors

inf-bit ECC

100-bit ECC

75-bit ECC

50-bit ECC

25-bit ECC

Page 27: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Observations

• Repeated data re-reads make it worse

– 75-bit: ~30% increased latency at end-of-life

27

Page 28: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Intra-SSD redundancy

D0D1D2D3D4D5D6

D0

D1

D2

D3

D4

D5

D6

P

P

28

Flash memory chips

Page 29: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Intra-SSD redundancy

D0

D1

D2

D3

D4

D5

D6

PD0D1PD3D4D5D6D2

29

Flash memory chips

Page 30: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Summary: intra-SSD redundancy

• Error correction code– Is fixed at design-time

• Data re-read– Increases latency for correcting errors

• Intra-SSD redundancy– Protects against random and sporadic errors– Increases write amplification– Increases read amplification on errors

30

Page 31: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Evaluation: redundancy

0

0.5

1

1.5

2

s = 15 s = 7

No

rm.

av

g.

RT

Stripe size

1K cycles 3K cycles 5K cycles

0

0.5

1

1.5

2

2.5

3

3.5

4

s = 15 s = 7

No

rm.

3 m

ines

Qo

S

Stripe size

1K cycles 3K cycles 5K cycles

For the 3D TLC (2018) with 75-bit ECC

31

Page 32: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Observations

• Repeated data re-reads make it worse

• Overheads of redundancy outweigh its benefits

– +56% latency at end-of-life

32

Page 33: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Observations

• Repeated data re-reads make it worse

• Overheads of redundancy outweigh its benefits

• Scrubbing reduces error-induced latency,but increases internal traffic– +25% latency at end-of-life

– Highly dependent on accuracy of error prediction

33

Page 34: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Observations

• Repeated data re-reads make it worse

• Overheads of redundancy outweigh its benefits

• Scrubbing reduces error-induced latency,but increases internal traffic

• We need to consider data characteristicsand compositionally combine reliability enhancements

34

Page 35: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Holistic reliability management

• Cold data

– Need protection against retention errors

– Least write amplification with redundancy

– Likely to be identified by GC

Page 36: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Holistic reliability management

• Cold data

– Selective redundancy for GC-ed data

• Read-hot data

– Need protection against disturbance errors

– # of data re-reads can be used as proxy

– Likely to be identified by scrubber

Page 37: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Holistic reliability management

• Cold data

– Selective redundancy for GC-ed data

• Read-hot data

– Cost-benefit scrubbing

• Write-hot data

– No special attention required

Page 38: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

Evaluation

38

0

1

2

3

4

No

rm.

av

g. R

TECC + re-read Oracle scrub HRM

9.0 5.4

ECC + re-read : Rely on ECC and data re-readsOracle scrub : Scrub based on oracle knowledgeHRM : Holistic reliability management

For the 3D TLC (2018)with 75-bit ECC@ end-of-life state

Page 39: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

The bright side of flash memory

• S. Lee, “Emerging Challenges in NAND Flash Technology”, Flash Summit 2011

39

Page 40: Design Tradeoffs for SSD Reliability · • Y. Cai et al, “Data retention in MLC NAND flash memory: characterization, optimization, and recovery”, HPCA 2015 • Data from an industry

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

2008 2010 2012 2014 2016 2018 2020

Err

or

rate

mea

sure

ment

Year published

The dark side of flash memory

40