ecc/dsp system architecture for enabling reliability scaling in

40
ECC/DSP System Architecture for Enabling Reliability Scaling in Sub-20nm NAND Eran Sharon, Idan Alrod , Avi Klein , Alon Eyal, Ofer Shapira Intelligent Memory Systems, Memory Division, SanDisk Corp. August 2013

Upload: others

Post on 11-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECC/DSP System Architecture for Enabling Reliability Scaling in

ECC/DSP System Architecture for Enabling Reliability Scaling in Sub-20nm NAND

Eran Sharon, Idan Alrod, Avi Klein, Alon Eyal, Ofer Shapira Intelligent Memory Systems, Memory Division, SanDisk Corp. August 2013

Page 2: ECC/DSP System Architecture for Enabling Reliability Scaling in

Forward-Looking Statements

During our presentation today we may make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, industry trends, future memory technology, technology transitions and future products. This presentation contains information from third parties which reflect their projections as of the date of issuance. Actual results may differ materially from those expressed in these forward looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof.

.

Disclaimer: This tutorial provides an overview of various techniques and concepts, some or all of which may not necessarily reflect what SanDisk is actually using in their products.

2

Page 3: ECC/DSP System Architecture for Enabling Reliability Scaling in

Outline

Gap Between Product Requirements and Technology Capability • Applications Requirements: Endurance, Performance, Power • Reliability Challenges with Scaling

ECC/DSP solutions • Tier 0: Adaptive NAND Parameters Optimization • Tier 1: Noise Reduction • Tier 2: Advanced Error Correction Coding (ECC) • Tier 3: Second level Error Correction (RAID) • Tier 4: Flash Management Algorithms • Tier 5: Host Data Manipulation

Summary

Disclaimer: This tutorial provides an overview of various techniques and concepts, some or all of which may not necessarily reflect what SanDisk is actually using in their products.

3

Page 4: ECC/DSP System Architecture for Enabling Reliability Scaling in

Increasing Product Requirements

Faster Web

Browsing

Smoother Multi-

tasking

Longer Battery

Life, Power Savings

Higher Resolution

Video Sharing,

Connectivity

Computational Photography

Faster Application Execution

4

Page 5: ECC/DSP System Architecture for Enabling Reliability Scaling in

iNAND™ EFD “Classic”

AFM 1.0 Performance Leadership

AFM 2.0 Application Awareness

AFM 3.0 Usage based Adaptive Flash Management™ Technology

Email, Music, & Video

Multitasking Apps

Computing

Gap Between Raw Memory Capability and Applications Requirements

Cost Reduction

5

Page 6: ECC/DSP System Architecture for Enabling Reliability Scaling in

AFM 4.0 Multi-dimensional Adaptive Flash Management™ Technology

iNAND™ EFD “Classic”

AFM 1.0 Performance Leadership

AFM 2.0 Application Awareness

Email, Music, & Video

Multitasking Apps

Computing

Cost Reduction

Gap Between Raw Memory Capability and Applications Requirements

6

Page 7: ECC/DSP System Architecture for Enabling Reliability Scaling in

Optimized Endurance for enhanced

video download & application caching

7

Page 8: ECC/DSP System Architecture for Enabling Reliability Scaling in

Optimized Performance for superior

gaming experience

8

Page 9: ECC/DSP System Architecture for Enabling Reliability Scaling in

Optimized Power Consumption

for longer web browsing

9

Page 10: ECC/DSP System Architecture for Enabling Reliability Scaling in

Reliability Challenges with Scaling

As an example we will describe the phenomena of Read Disturb

10

Page 11: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation

11

1. BL Pre-Charge

Page 12: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation

12

1. BL Pre-Charge 2. Gate Voltages

Opens “select gate” transistors Opens unselected cells – “Victims” Senses state of selected cell – “Target”

Target Cell 6v 6v 6v 6v 1v

Threshold Voltage numbers are nominal

Page 13: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation Target Cell is Erased – Read “1”

13

1. BL Pre-Charge 2. Gate Voltages 3. Sensing

Page 14: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation Target Cell is Erased – Read “1”

14

1. BL Pre-Charge 2. Gate Voltages 3. Sensing

Page 15: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation Target Cell is Erased – Read “1”

15

1. BL Pre-Charge 2. Gate Voltages 3. Sensing

Page 16: ECC/DSP System Architecture for Enabling Reliability Scaling in

16

1. BL Pre-Charge 2. Gate Voltages 3. Sensing

Current is Sensed Cell is Erased!

Read “1”

Read Operation Erased Cell – Read “1”

Page 17: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Operation Target cell is Programmed – Read “0”

17

1. BL Pre-Charge 2. Gate Voltages 3. Sensing

Current is NOT Sensed Cell is Programmed!

Read “0”

Page 18: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Disturb

18

Er A B C

Er → A

Page 19: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Disturb

19

Floating Gate

ONO Dielectric

Control Gate

Tunnel Oxide

P/E cycles leads to Tunnel Oxide (Tox) degradation that creates traps

Page 20: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Disturb

20

P/E cycles lead to Tunnel Oxide (Tox) degradation that creates traps

“Weak Programming” in unselected cells due to unintentional tunneling of electrons to the FG

S e l e c t e d C e l l ( Ta r g e t )

U n s e l e c t e d ( V i c t i m )

Page 21: ECC/DSP System Architecture for Enabling Reliability Scaling in

Read Disturb

21

P/E cycles lead to Tunnel Oxide (Tox) degradation that creates traps

“Weak Programming” in unselected cells due to unintentional tunneling of electrons to the FG

S e l e c t e d C e l l ( Ta r g e t )

U n s e l e c t e d ( V i c t i m )

Page 22: ECC/DSP System Architecture for Enabling Reliability Scaling in

22

NAND (Raw)

System

Noise Reduction

Advanced Error Correction Coding (ECC)

Second level Error Correction (RAID)

Flash Management Algorithms

Host Data Manipulation

0 1 2 3 4 5

Tier

s

Adaptive NAND Parameters Optimization

ECC/DSP Methods: from NAND to System

Page 23: ECC/DSP System Architecture for Enabling Reliability Scaling in

23

Errors 1e-1 Many Errors

1e-21 Few Errors

ECC Error Correction Coding (BCH)

Basic ECC sufficient to meet application requirements

Error Handling System Solutions Early Technologies

Page 24: ECC/DSP System Architecture for Enabling Reliability Scaling in

Error Handling System Solutions Advanced (Sub-20nm)

24

Tier 0: Adaptive

Parameters

Tier 1: Noise

Reduction

Tier 2: Advanced

ECC

Tier 3: Second Level

Error Correction

Sophisticated ECC and DSP techniques applied to mitigate the natural drift in reliability, and to meet the more demanding requirements of embedded application.

Errors

1e-1 Many Errors

1e-21 Few Errors

Tier 4: Flash Management

Algorithm

Tier 5: Host Data

Manipulation

Performance./Endurance/Power Enhancement Low High

Page 25: ECC/DSP System Architecture for Enabling Reliability Scaling in

Tier 0: Adaptive NAND Parameters Optimization

Adaptive NAND parameters optimization along the memory lifetime.

Parameter setting (“trimming”) of the Program, Erase and Read parameters.

System level feedback adapts the parameters to: – Memory wearing and error rates along the lifetime – Die to die, block to block, WL to WL variations within the

memory – Host data patterns

Once NAND level optimization has been exhausted, the residual noises and errors need to be handled at system level

25

Page 26: ECC/DSP System Architecture for Enabling Reliability Scaling in

1K W/E

V

Vr (i+2)Vr (i+1)Vr (i)

V

Vr (“final”) (i-1) Vr (“final”) (i) Vr (“final”) (i+1) Vr (“final”) (i+2)

Adaptive Read Thresholds – Example Problem: Cell Voltage Distribution is not fixed:

• Changes along the memory lifetime with W/E cycling and time (DR) • Variations within a die - changes from Block to Block, WL to WL,…

Using a fixed set default thresholds result in high BER and decoding failure Solution: Adaptive read thresholds

26

V

Dynamic Read Shift (i) Dynamic Read

Shift (i+1)Dynamic Read

Shift (i+2)

Page 27: ECC/DSP System Architecture for Enabling Reliability Scaling in

Tier 1: Noise Reduction System level residual NAND “noise” reduction via DSP and

coding techniques, aimed at reducing error rates to a bare minimum level • Tier 1 countermeasures may reduce raw NAND error rates from a

~1E-1 error level to ~1E-2 error level • Tier 1 countermeasures are aimed at:

– Ensuring that the next Tier 2 Error Correction Coding (ECC) is cost effective (i.e. less redundancy)

– Maximizing performance and reducing power consumption • Tier 1 countermeasures deal with non intrinsic “noises”, which

can be cancelled out, mitigated or compensated for: – Data dependent noises such as cross-coupling induced widening,

back pattern effects, Program and Read Disturbs, Over programming errors, etc.

27

Page 28: ECC/DSP System Architecture for Enabling Reliability Scaling in

NAND Scaling Challenges – Interferences

28 Source: Semiconductor Insights

Affected Cell

Neighbor WL Cell

WLn+1

WLn Neighbor

BL Cell

Neighbor BL Cell

Neighbor Diagonal

Cell

Neighbor Diagonal

Cell

WLn-1 Neighbor Diagonal

Cell

Neighbor Diagonal

Cell

Neighbor WL Cell

Page 29: ECC/DSP System Architecture for Enabling Reliability Scaling in

29

Cross Coupling Widening effect

Page 30: ECC/DSP System Architecture for Enabling Reliability Scaling in

Cell-to-Cell Coupling (CCC) Trend

With technology scaling, CCC increases dramatically Air Gap technology make the 19nm (AG) CCC equivalent to 24nm

(no AG)~ 27% reduction

Nor

mal

ized

CC

C R

atio

Air Gap

WL

Yan Li, ISSCC 2012

FG

FG

FG

FG

FG

FG

FG

FG

FG

160nm 90nm 70 nm 56nm 43nm 32nm 24nm 24nm(AG) 19nm 19nm(AG)

Diagonal WL-WL BL-BL

30

Page 31: ECC/DSP System Architecture for Enabling Reliability Scaling in

CVD of entire cell population

V

Level (i) Level (i+1)

...Example: Read

LSB page

V

Level (i) Level (i+1)

111

Vr

Without utilizing neighbor cell information:

LSBit = “0” with low reliability

110LSB

MSB ...

Digitally mitigating cross coupling and other data depended noises during read by taking into account the neighboring cell’s read state

31

CVD of entire cell population

V

Level (i) Level (i+1)

...

CVD of cells whose neighbor

is erased

CVD of entire cell population

V

Level (i) Level (i+1)

...

CVD of cells whose neighbor is

programmed to highest state

CVD of cells whose neighbor

is erased

Example: Read LSB page

V

Level (i) Level (i+1)

111

Vr

Without utilizing neighbor cell information:

LSBit = “0” with low reliability

110LSB

MSB ...111

110

With utilizing neighbor cell information:

LSBit = “1” with high reliability

Assume we know that the neighbor was programmed to highest state

31

Mitigating Data Dependent Noises

Page 32: ECC/DSP System Architecture for Enabling Reliability Scaling in

Tier 2: Advanced Error Correction Coding (ECC)

Advanced Error Correction Coding (ECC) is required in order to handle the residual errors of tier 1 • Tier 2 ECC can reduce the ~1E-2 residual error levels of tier 1 to

~1E-16 error level • State of the art iterative coding techniques, such as LDPC, are

replacing algebraic coding techniques, such as BCH codes • Advanced ECC techniques are essential for achieving an optimal

cost, endurance and performance tradeoff, as they allow operation near the theoretic limits (Shannon limit), providing maximal correction capability for a given amount of overprovisioning (“ECC redundancy”)

32

Page 33: ECC/DSP System Architecture for Enabling Reliability Scaling in

Actual computations are more complicated. Depend on: • Verify and read voltage levels • Data retention • P/E cycles • Temperature • Tuning voltage ambiguity

Flash Information Theory…

FlashProgramming voltage level

Read voltage level

2( ) ( ) ,

( | ) information bitsmax ( ; ) max ( ) ( | ) log ( ) ( | ) cellP X P X X Y

X

P Y XC I X Y P X P Y XP X P Y X

= =

∑ ∑

How can we compute the Flash capacity?: Information Theory (Shannon 1948)

Based on knowing probability to read a voltage level Y given that a voltage level X was programmed

For such a simplified model:

• Cross coupling • Back pattern • Program/Read disturb … 33

Page 34: ECC/DSP System Architecture for Enabling Reliability Scaling in

Approaching the Shannon Limit

Turbo Code

LDPC

Conv + RS

Source: Forward Insights

34

Page 35: ECC/DSP System Architecture for Enabling Reliability Scaling in

Tier 3: Second level Error Correction (RAID)

For enhanced reliability, especially required for SSD applications, a second level error correction, aimed to deal with complete NAND failures resulting in colossal errors, is required. RAID like techniques are used for that purpose. • Tier 3 level protection is used for both:

– Reducing tier 2 error rates from ~1E-16 to ~1E-24 or lower – Reducing dPPM levels due to gross NAND failure, such as WL

breaks, WL shorts, etc. • Tier 3 protection may require extra overprovisioning, or may only

maintain the overprovisioning temporarily in the controller until verifying data integrity.

35

Page 36: ECC/DSP System Architecture for Enabling Reliability Scaling in

RAID Example

36

Page 37: ECC/DSP System Architecture for Enabling Reliability Scaling in

Tier 4: Flash Management Algorithms

Back End Flash Management algorithms which manage how logical data is stored on the physical NAND level, in a way that will provide the best performance (both sequential, random or any other combined use case) and the best endurance

Examples of Flash management functions are: • Logical to physical address • Wear leveling • Garbage collection

37

Page 38: ECC/DSP System Architecture for Enabling Reliability Scaling in

Host data manipulation, leveraging the inherent “redundancy” in the host data for improving endurance, performance and power • Examination of host data produced by users or arising from various

operating and file system shows that a significant fraction of the data is of low entropy, having many repetitive data patterns

• Low entropy data from the host can be manipulated by the controller in various ways:

– Compression, Endurance coding, Deduplication

Tier 5: Host Data Manipulation

38

Page 39: ECC/DSP System Architecture for Enabling Reliability Scaling in

Optimization for Endurance

Tier 0: Adaptive Parameters

Tier 1: Noise Reduction

Tier 2: Advanced ECC

Tier 3: Second Level Error Correction

Tier 4: Flash Management Algorithm

Tier 5:

Host Data Manipulation Tier 0 Tier 1 Tier 2 Tier 3 Tier 4

Optimization for Power

Lower Power Consumption

Higher Endurance

Optimization: Optimization for different

tradeoffs

Tier 5 Optimization

For Performance

Better Performance

Summary

39

Page 40: ECC/DSP System Architecture for Enabling Reliability Scaling in

© 2013 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. iNAND and Adaptive Flash Management are trademarks of SanDisk Corporation. Other brand names mentioned herein are for identification purposes only and may be the trademarks of their respective

Thank you!