due to the economic downturn, microsoft research has eliminated all funding for title slides. we...

56
Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures may have on this presentation.

Upload: joan-harrell

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Due to the economic downturn, Microsoft Research has eliminated all

funding for title slides.

We sincerely apologize for any impact these austerity measures may have on

this presentation.

2

Charge-Based DRAM Cell

The End of Charge-Based Memory

𝑄00𝑄11

3

Resistive Memory Cell

Metal (to bit line)

Metal (to sensor line)

R0 (𝛺) 0R1(𝛺)1

Use ECP, not ECC,for Hard Failures

in Resistive

Memories

Stuart SchechterGabriel Loh

Karin Strauss

Doug Burger

and introducing(in the ape suit)

4

Resistive Memory Cell

Metal (to bit line)

Metal (to sensor line)

5

Phase-Change Memory Cell

Metal (to bit line)

Metal (to sensor line)

Phase-Change Memory (PCM)

6

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

Metal (to sensor line)

Phase-Change Memory (PCM)

7

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

Phase-Change Memory (PCM)

8

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

Phase-Change Memory (PCM)

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

𝑅𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒   (𝛺)

• Nonvolatile• Scalability not impacted by capacitance limits• Cells may be written individually• Slower, with more energy intensive writes

2x slower for reads, 10x slower for writes

Phase-Change Memory vs. DRAM

• The heating element loses resistivity, or• Expansion/contraction causes detachment– Mean expected lifetime 108 writes (but varies)

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

Hard Failures in Resistive Memories

After 108 writes together, we’re not connecting like we used to. I just don’t feel the heat anymore.

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

Ω𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒

• DRAM cells encounter soft (transient), errors—May occur between write and future read

• PCM cells encounter hard (permanent) failures— Occur at write time (detectable by verifying read)— Increase in frequency over product lifetime

Phase-Change Memory vs. DRAM

Phase-Change Memory Cell

Metal (to bit line)

heatingelement

chalcogenide(phase change material)

Metal (to sensor line)

Ω𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒

Living with Hard Failures in Memory Cells

Assume we’re already doing these things

1. Accept failure of some fraction of pages— Map failed pages out of logical memory

2. Wear-level data pages/blocks, & within blocks—Shift/rotate data randomly (intervals/locations)

3. Differential writes— Write only cells with values that change

4. Correct errors when possible

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

A Page Must Be Retired When…

The first cell within a page fails

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

8 chips

8 bits/chip

SEC/SECDED

64 bits

7/8 bits10.9%/12.5% overhead

We use this 12.5% overhead limit for all schemes in our study

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

A Page Must Be Retired When…

8 chips

8 bits/chip

SEC/SECDED

64 bits

7/8 bits10.9%/12.5% overhead

A block within the page suffers a second error

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

8 chips

8 bits/chip

64 bits

Error Correction Schemes

Error Correcting PointersCorrection

Chip

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correcting Pointers

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correcting Pointers

0 1 1 0 … 1 0 0

511 510 509 508 3 2 1

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

correction pointer

data cells

1

R

replacement cell

1

correction entry

1

Full?

1

0

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

ECP1

0 1 1 0 … 1 0 0

511 510 509 508 3 2 1

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

1

R

1

0

Full? 5 3 2 1 0

1

0

data cellscorrection entries

0000

4

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

0001 0 1 1 0 … 1 0 0

511 510 509 508 3 2 1

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

1

R

1

5 3 2 1 0

1

0

1 1 1 1 1 1 0

8 7 6 5 3 2 1

1

4

1

0

0

R

0

0010

4

0

Full?

data cellscorrection entries

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

0001 0 1 1 0 … 1 0 0

511 510 509 508 3 2 15 3 2 1 0

1

0

data cellscorrection entries

1

Full? 4

A row within the page suffers moreerrors than it has correction entries*

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

A Page Must Be Retired When…

*What if correction entry fails?

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

There’s a precedence rule for that

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

0001 0 1 1 0 … 1 0 0

511 510 509 508 3 2 1

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

0

R

1

3 2 1 0

1

0

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

1

R

1

0

Full?

0001

5

0010

4

data cellscorrection entries

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

0001 0 1 1 0 … 1 0 0

511 510 509 508 3 2 1

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

1

0

1

R

1

5 3 2 1 0

1

0

0 0 0 0 0 0 1

8 7 6 5 3 2 1

0

4

0

0

1

R

1

0010

1

0

Full? 4

data cellscorrection entries

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

0001 0 1 1 0 … 1 0 0

511 510 509 508 3 2 15 3 2 1 0

1

0

data cellscorrection entries

1

Full? 4

A row within the page suffers moreerrors than it has correction entries*

* *

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

A Page Must Be Retired When…

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

Wilkerson, Gao, Alameldeen, Chishti, Khellah, & Lu, ISCA 2008

For fixing errors induced by runningSRAM caches at low voltages

Error Correction Schemes

0 1 1 0 … 1 0

511 510 509 508 3 2 1

0 0 0 0 0 0 0

8 7 6 5 3 2 1

0

4

0

R1

0

R0

1

5 3 2 1 0 04

10

0 01

data cellscorrection entries

0 0 0 1

SEC

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

A row within the page suffers moreerrors than it has correction entries*

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

A Page Must Be Retired When…

8 chips

8 bits/chip

Parity Bits

64 bits

8 bits12.5% overhead

Error Correction Schemes

Ipek, Condit, Nightingale, Burger, & Moscibroda, ASPLOS 2009

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

0 0 0 0 0 10 00

0 0 1 1 0 11 01

1 1 0 0 0 00 02

0 0 0 0 0 10 03

0 0 0 0 0 10 14

0 0 0 0 0 10 05

1 1 1 1 1 11 16

1

0

0

1

0

1

0

0 0 0 0 0 0 00 07

Error Correction Schemes

Page

Byte

0 0 0 0 0 10 00

0 0 1 1 0 11 01

1 1 0 0 0 00 02

0 0 0 0 0 10 03

0 0 0 0 0 10 14

0 0 0 0 0 10 05

1 1 1 1 1 11 16

1

0

0

1

0

1

0

0 0 0 0 0 0 00 07

Paired Page

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

Page

160 errors occur within page.

(an average of 5 errors per 1024 bits)

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

A Page Must Be Retired When…

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Error Correction Schemes

8 chips

8 bits/chip

SEC/SECDED

8 accesses x 64 bits = 512 bit block

8 transfers x 8 bits64 bits

• Multiple error correction– 576 bits of storage– 512 data bits– At most 9 corrections possible (Hamming bound)

Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

A 576-bit block (holding 512 data bits)suffers its 10th error

A Page Must Be Retired When…

• 4kByte (32Kbit) page size• 32Byte (512 bit) row size• 1 Rank• 8 Chips per rank• x8 Bit lines per chip• 108 mean writes until memory cell failure• .25 coefficient of variance for cell lifetimes

Experimental Parameters

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Half capacity lost on first error(when pages are paired)

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

ResultsWrite Modification Width

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

512 bit block(16 x 32 bit words)

Entire 512 bit region modified (each bit flips with p=0.5)

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 00

0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Error Correction Schemes

A Review of Hamming Codes

Flipping just one data bit changes, on average,half of the hamming code bits

ResultsWrite Modification Width

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

01011111110110000000001011111000

512 bit block(16 x 32 bit words)

256 bit modification128bit mod

Coefficient of variance = 0.25

Write Modification Width of 512 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

Write Modification Width of 256 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

Write Modification Width of 128 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Perfect codes would always win if you leveled the wear of the correction bits and data bits

Fair Treatment of Perfect Codes

No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

Write Modification Width of 128 Bits Internal W

ear Leveling Inte

rnal

Wea

r Lev

elin

g

Coefficient of variance = 0.25

Write Modification Width of 256 Bits Internal W

ear Leveling Inte

rnal

Wea

r Lev

elin

gNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

Coefficient of variance = 0.25

Write Modification Width of 512 Bits Internal W

ear Leveling Inte

rnal

Wea

r Lev

elin

gNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9

• Correction pointers >> Pairing

• Correction pointers >> SEC over small blocks– Precedence rules >> SEC over correction entries

• Correction pointers >= MEC over large blocks– Lower computational cost

• For memory with both hard and transient errors– Use ECP below, on chip (for hard errors)– Use SEC above, off chip (for soft errors)

Conclusion

Backup Slide for Responding to QuestionsYou didn’t

expect we’d believe this…

did you?

56

?I’m sorry dear,

but if that’s the best talk we’ll be capable of even after millions of years of evolution,

I think its best we not reproduce.