error recovery flows in nand flash ssds · 09/08/2018 · marvell semiconductor, inc. flash memory...
TRANSCRIPT
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Error Recovery Flows in NAND Flash SSDs
Viet-Dzung Nguyen
Marvell Semiconductor, Inc.
Flash Memory Summit 2018
Santa Clara, CA 1
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Outline
2
• Data Reliability in NAND Flash Memories
• Concept of an Error Recovery Flow (ERF)
• Components of an ERF
• Optimization of an ERF in an LDPC-Based Controller
• Comparison with a BCH-Based ERF
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Data Reliability in NAND Flash
3
• NAND flash stores information in different
charge levels of NAND flash cells
• In a fresh NAND device, the distributions of
threshold voltages around their mean values
are very sharp
- Hence, the number of bits detected erroneously
is very small
• In a NAND device stressed by program/erase
cycle, retention or read disturb, the
distributions become wider
- Hence, the number of bits detected erroneously
is larger
1 0
1 0
Fresh NAND
Stressed NAND
Charge level
Charge level
Vth
Vth
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Error Correction
4
• Error correction codes are employed to correct errors introduced by NAND flash
• Decoder classification:
- Hard decoders use the hard decision generated by one read from NAND
- Soft decoders use soft information generated by many reads from NAND
ECC
encoder
User
dataECC
codeword
User
data
Noisy ECC
codeword
NANDNANDNANDNANDHOST
ECC
encoder
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Error Recovery Flow
5
Issue a read to
NAND with a Vth
setting, decode with
a hard decoder
Is decoding
successful?
Data sent to
host
Yes
Is decoding
successful?
Exit ERF
with failure
Any more
ERPs left?
Set i:=i+1
Exit ERF
with success
Run error recovery
procedure i
Enter ERF
Set i=0
No
On-the-fly Decode Error Recovery Flow
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Components of an ERF
6
Hard Read Retry• Retrying hard input decoder by performing a read with new Vth
• Useful when sub-optimal Vth were used for original read. The performance is not sensitive to quality of soft information
Soft Read Retry
• Running soft input decoder using soft information collected from multiple reads
• Useful to recover pages with many error bits. Has higher average latency but very good performance
Vth Calibration• Inferring optimal Vth using histogram of charge levels
• Useful when sub-optimal Vth settings are reason for failure and a good setting could not be found easily
LLR Calibration
• Adjusting LLRs assignment based on collected charge level histogram
• Useful when soft information is collected with sub-optimal Vth and LLRs were assigned assuming that the Vth were optimal
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Components of an ERF (cont.)
7
Inter-Cell Interference Cancellation
• Cancelling the interference caused by adjacent cells by assigning LLRs based on states of adjacent cells
• Useful to recover pages with very bad quality
Hard Error Mitigation• Adjusting LLRs taking hard errors into account, detecting bad cells
and bad bit-lines
• Useful to recover pages with disproportionate number of hard errors
RAID
• Recovering one failed component in a RAID stripe by XOR-ing the remaining successful components in the stripe
• Useful to recover from die failures or specific failure mechanisms of 3D NANDs
© 2018 Marvell Semiconductor, Inc. All rights reserved.
ERF Optimization Goal
8
• Recover data with highest
probability of success and lowest
latency:
- Some system has specific
requirements on host time out
- A good ERF must produce a good
shape of the latency distribution
- A good ERF must not exhaust
hardware and NAND resources
that are needed to sustain other
traffic
0 200 400 600 800 1000 12000
0.4
0.8
1.3
1.7
2.1
Error Recovery Flow Latency (s)
Perc
enta
ge
© 2018 Marvell Semiconductor, Inc. All rights reserved.
ERF Classification
9
Simple/Typical/Static Complex/Customized/Adaptive
Description 1. Hard decode with Vth settings from the
NAND vendor’s read retry table
2. Soft decode with a few hard-coded Vth
settings
3. RAID
1. Start with a hard decode that uses a customized Vth setting
(derived by the firmware’s media management algorithm)
2. The subsequent steps can be either a hard decode step or soft
decode step. Each read in the subsequent steps use a Vth setting
that is adaptively derived.
3. Early detection of defective NANDs and trigger RAID as soon as
possible if RAID is deemed necessary
Vth Settings Pre-determined Derived on-the-fly
Reliability Typically meets low-end requirements if
stay within or below NAND vendor’s NAND
stressing boundary
Meet high-end requirements even if NAND is stressed beyond
NAND vendor’s stressing boundary
Latency Acceptable average latency, very bad
worst case latency
Optimal latency distribution
Optimization
complexity
Simple, requires little NAND
characterization
Complex, requires extensive NAND characterization
© 2018 Marvell Semiconductor, Inc. All rights reserved.
ERF Optimization Strategy
10
• NAND characterization
NAND
0 10 20 30 40 50 60 7010
-7
10-6
10-5
10-4
10-3
10-2
10-1
100
Default read, RD0, RT0, read repeat0
Number of Error Bits / AU
Fre
quency
PE =6
PE =106
PE =206
PE =1506
PE =3006
PE =3506
0 500 1000 150010
-7
10-6
10-5
10-4
10-3
10-2
10-1
Default read, RD0, RT0, read repeat0
Number of Error Bits / AU
Fre
quency
PE =3
PE =103
PE =203
PE =1503
PE =3003
PE =3503
0 20 40 60 80 100 120 140 160 180 20010
-7
10-6
10-5
10-4
10-3
10-2
10-1
Normal Mode, best read, RD0, RT0, read repeat0
Number of Error Bits / AU
Fre
quency
PE =3
PE =103
PE =203
PE =1503
PE =3003
PE =3503
• ERF Optimization
HARDWARE SIMULATOR
(MARVELL INTERNAL,
ALSO AVAILABLE FOR
SELECTED CUSTOMERS)
NAND error bit distributions
at various NAND stressing
condition and Vth settings
NAND error bit distributions
at various NAND stressing
condition and Vth settings
DSP Techniques
Statistical Analysis Tools
Error Recovery Flow
Refine
Performance estimations:
- Retry rate at different drive conditions
- Throughput at different drive conditions
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Comparison with BCH-Based ERF
11
• In BCH-based controllers, the aim of ERF is to find a Vth setting in which the
number of raw bit errors is less than the correction capability of the BCH
code
• Hence, ERF in BCH controllers primarily involves read retries in which the
failed page is retried by reading with a different Vth setting for next read retry- If a particular read retry results in a failure, it does not provide any feedback which can help
in choosing the Vth for next read retry
- This means that the long tail in the distribution of retry latency cannot be cut short
• Don’t use an LDPC-based ERF that looks like a BCH-based ERF
• Advanced tools exists (at least in Marvell’s controllers): We collaborate with
customers on optimizing ERFs as well!
© 2018 Marvell Semiconductor, Inc. All rights reserved.
Questions?
12