efficient scrub mechanisms for error-prone emerging memories

Post on 24-Feb-2016

84 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Efficient Scrub Mechanisms for Error-Prone Emerging Memories. Manu Awasthi ǂ , Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡ , Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research. Executive Summary. - PowerPoint PPT Presentation

TRANSCRIPT

Efficient Scrub Mechanisms for Error-Prone Emerging Memories

Manu Awasthiǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran‡, Viji

Srinivasan‡

ǂMicron, ⁺University of Utah, ‡IBM Research

2

Executive Summary

• Future memory hierarchies will likely include NVMs– Focus of this work: Phase Change Memory (PCM)

• Multi level cells in PCM appear imminent• A number of proposals exist to handle hard errors

and lifetime issues of PCM devices• Resistance Drift is a lesser explored phenomenon– Will become primary cause of “soft errors” -- Need to

explore holistic solutions to counter drift

3

Phase Change Memory - MLC

• Chalcogenide material can exist in crystalline or amorphous states

• The material can also be programmed into intermediate states– Leads to many intermediate states, paving way for Multi

Level Cells (MLCs)

Resistance AmorphousCrystalline

(11) (00)(10) (01)

(111) (000)(101) (011) (010) (001) (110) (100)

4

What is Resistance Drift?

Resistance

Time11 10 01 00

A

B

ERROR!!

T0

Tn

5

Resistance Drift - Issues

• Programmed resistance drifts according to power law equation -

• R0, α usually follow a Gaussian distribution• Time to drift (error) depends on – Programmed resistance (R0), and – Drift Coefficient (α)– Is highly unpredictable!!

Rdrift(t) = R0 х (t)α

6

Resistance Drift - How it happens

11 10 01 00

Median case cell• Typical R0• Typical α

Worst case cell• High R0• High α

Scrub rate will be dictated by the Worst Case R0 and Worst Case α

Drift Drift

R0 R0Rt Rt

ERROR!!

Number of Cells

7

Resistance Drift DataCell Type Drift Time at Room

temperature (secs)Median 11 cell 10499

Worst 11 Case cell 1015

Median 10 cell 1024

Worst Case 10 cell 5.94

Median 01 cell 108

Worst Case 01 cell 1.81

(11) (00)(10) (01)

8

Uncorrectable Errors

9

Problem Severity

10

Inadequacy of Device Level Solutions

• Precise writes –> write – compare – write– Can be effective, requires multiple iterations– Increases write time, total write energy, decreases

lifetime• Correcting values during read not effective• Coding techniques– Can help, but cannot eliminate the problem by

themselves

11

Naïve Solution : Scrubbing

• Drift resets with every cell reprogram (write)• Leverage existing error correction mechanisms,

e.g., ECC - has its own drawbacks• A Full Refresh/Scrub (read-compare-write) is

extremely costly in PCM– Each PCM write takes 100 - 1000ns– Writing to a 2-bit cell may consume as much as 1.6nJ– Requires 600 refreshes in parallel

Refresh should be reactionary NOT precautionary!

12

Solution Outline

• A three pronged approach

– Incorporate stronger ECC codes

– Incorporate low-cost error detection mechanisms

– Adapt scrub algorithms that are conservative and adaptive

13

Additional ECC support

• With stronger ECC support, scrub operations can be made infrequent– If E errors can be corrected and E+1 detected, then

scrub operation can be postponed till the Eth error– Can provide support for hard error tolerance as well

• We advocate ECC-8 (E-8)– Correct eight errors, detect nine– Leads to 14.25% storage overhead, for 512 bits

14

• Light Array Reads for Drift Detection– Support for E Error-correcting, E+1

error detecting codes assumed– Lines are read periodically and

checked for correctness– Only after the number of errors

reaches a threshold, scrubbing is performed

– Drift detection has to be simple, on-chip

LARDD

Read Line

Check for Errors

Errors < E

Scrub Line

True

False

After N cycles

15

Read Pipeline with LARDD & Parity

• Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32-cell fields– Count number of drift prone states, store as parity

information– For every LARDD, check if parity has changed– If true, invoke BCH circuitry

1515(11) (00)(10) (01)

16

Headroom

• Headroom-h scheme – scrub is triggered if E-h errors are detected

† Decreases probability of errors slipping through

– Increases frequency of full scrub and hence decreases life time

• Presents trade-off between Hard and Soft errors

Read Line

Check for Errors

Errors < E-h

Scrub Line

True

False

After N cycles

17

Headroom Gradual

• Start with a small LARDD frequency

• Increase frequency as drift-based errors increase– Decreases energy

overhead– Might cause slight

increase in error rate

Read Line

Check for Errors

Errors <= E-h-g True

False

After N cycles

Double LARDD rate

18

Adaptive Scrub

• Dynamic events can affect reliability– Temperature increases can increase α and decrease drift

time– Cell lifetime/wearout is also an issue– Soft error rate depends on prevalence of drift prone states– Hard errors show up with aging cells

• Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors

• Can increase LARDD frequency when total number of hard errors increases preset threshold

19

Results

Stronger ECC + LARDD support leads to decrease in unrecoverable errors

20

Results - Headroom Schemes

• Compared to ECC-1, for 512 s LARDD interval– ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4%

reduction in scrub energy– ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate,

37.8% reduction in scrub energy

21

Device Level Solution – Non Uniform Banding

Resistance

Before

After

Mean R0

11 10 01 00

22

Comparing with Device Level Solutions

Baselin

e ECC-1 (2

.75 σ)

ECC-1 +

Precise W

rite (1

.375 σ)

ECC-1 +

Non-uniform

banding

ECC-8 (2

.75 σ)

ECC-8-head

room-3 (2.75 σ)

0E+00

1E+05

2E+05

Number of Uncorrectable Errors

ECC-8-headroom-3 provides for least number of uncorrectable errors

23

Conclusions

• Resistance drift will exacerbate with MLC scaling• Naïve solutions based on ECC support are costly for

PCM– Increased write energy, decreased lifetimes

• Holistic solutions need to be explored to counter drift at device, architectural and system levels– 39% reduction in energy, 4x less errors, 102x increase in

lifetime

24

Thanks!!

www.cs.utah.edu/arch-research

25

Backup Slides

top related