due to the economic downturn, microsoft research has eliminated all funding for title slides. we...
TRANSCRIPT
Due to the economic downturn, Microsoft Research has eliminated all
funding for title slides.
We sincerely apologize for any impact these austerity measures may have on
this presentation.
3
Resistive Memory Cell
Metal (to bit line)
Metal (to sensor line)
R0 (𝛺) 0R1(𝛺)1
Use ECP, not ECC,for Hard Failures
in Resistive
Memories
Stuart SchechterGabriel Loh
Karin Strauss
Doug Burger
and introducing(in the ape suit)
6
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
Metal (to sensor line)
Phase-Change Memory (PCM)
7
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
Phase-Change Memory (PCM)
8
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
Phase-Change Memory (PCM)
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
𝑅𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒 (𝛺)
• Nonvolatile• Scalability not impacted by capacitance limits• Cells may be written individually• Slower, with more energy intensive writes
2x slower for reads, 10x slower for writes
Phase-Change Memory vs. DRAM
• The heating element loses resistivity, or• Expansion/contraction causes detachment– Mean expected lifetime 108 writes (but varies)
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
Hard Failures in Resistive Memories
After 108 writes together, we’re not connecting like we used to. I just don’t feel the heat anymore.
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
Ω𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒
• DRAM cells encounter soft (transient), errors—May occur between write and future read
• PCM cells encounter hard (permanent) failures— Occur at write time (detectable by verifying read)— Increase in frequency over product lifetime
Phase-Change Memory vs. DRAM
Phase-Change Memory Cell
Metal (to bit line)
heatingelement
chalcogenide(phase change material)
Metal (to sensor line)
Ω𝑐𝑟𝑦𝑠𝑡𝑎𝑙𝑖𝑛𝑒
Living with Hard Failures in Memory Cells
Assume we’re already doing these things
1. Accept failure of some fraction of pages— Map failed pages out of logical memory
2. Wear-level data pages/blocks, & within blocks—Shift/rotate data randomly (intervals/locations)
3. Differential writes— Write only cells with values that change
4. Correct errors when possible
Error Correction Schemes
A Page Must Be Retired When…
The first cell within a page fails
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
8 chips
8 bits/chip
SEC/SECDED
64 bits
7/8 bits10.9%/12.5% overhead
We use this 12.5% overhead limit for all schemes in our study
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
A Page Must Be Retired When…
8 chips
8 bits/chip
SEC/SECDED
64 bits
7/8 bits10.9%/12.5% overhead
A block within the page suffers a second error
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
8 chips
8 bits/chip
64 bits
Error Correction Schemes
Error Correcting PointersCorrection
Chip
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correcting Pointers
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correcting Pointers
0 1 1 0 … 1 0 0
511 510 509 508 3 2 1
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
correction pointer
data cells
1
R
replacement cell
1
correction entry
1
Full?
1
0
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
ECP1
0 1 1 0 … 1 0 0
511 510 509 508 3 2 1
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
1
R
1
0
Full? 5 3 2 1 0
1
0
data cellscorrection entries
0000
4
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
0001 0 1 1 0 … 1 0 0
511 510 509 508 3 2 1
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
1
R
1
5 3 2 1 0
1
0
1 1 1 1 1 1 0
8 7 6 5 3 2 1
1
4
1
0
0
R
0
0010
4
0
Full?
data cellscorrection entries
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
0001 0 1 1 0 … 1 0 0
511 510 509 508 3 2 15 3 2 1 0
1
0
data cellscorrection entries
1
Full? 4
A row within the page suffers moreerrors than it has correction entries*
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
A Page Must Be Retired When…
*What if correction entry fails?
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
There’s a precedence rule for that
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
0001 0 1 1 0 … 1 0 0
511 510 509 508 3 2 1
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
0
R
1
3 2 1 0
1
0
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
1
R
1
0
Full?
0001
5
0010
4
data cellscorrection entries
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
0001 0 1 1 0 … 1 0 0
511 510 509 508 3 2 1
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
1
0
1
R
1
5 3 2 1 0
1
0
0 0 0 0 0 0 1
8 7 6 5 3 2 1
0
4
0
0
1
R
1
0010
1
0
Full? 4
data cellscorrection entries
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
0001 0 1 1 0 … 1 0 0
511 510 509 508 3 2 15 3 2 1 0
1
0
data cellscorrection entries
1
Full? 4
A row within the page suffers moreerrors than it has correction entries*
* *
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
A Page Must Be Retired When…
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
Wilkerson, Gao, Alameldeen, Chishti, Khellah, & Lu, ISCA 2008
For fixing errors induced by runningSRAM caches at low voltages
Error Correction Schemes
0 1 1 0 … 1 0
511 510 509 508 3 2 1
0 0 0 0 0 0 0
8 7 6 5 3 2 1
0
4
0
R1
0
R0
1
5 3 2 1 0 04
10
0 01
data cellscorrection entries
0 0 0 1
SEC
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
A row within the page suffers moreerrors than it has correction entries*
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
A Page Must Be Retired When…
8 chips
8 bits/chip
Parity Bits
64 bits
8 bits12.5% overhead
Error Correction Schemes
Ipek, Condit, Nightingale, Burger, & Moscibroda, ASPLOS 2009
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
0 0 0 0 0 10 00
0 0 1 1 0 11 01
1 1 0 0 0 00 02
0 0 0 0 0 10 03
0 0 0 0 0 10 14
0 0 0 0 0 10 05
1 1 1 1 1 11 16
1
0
0
1
0
1
0
0 0 0 0 0 0 00 07
Error Correction Schemes
Page
Byte
0 0 0 0 0 10 00
0 0 1 1 0 11 01
1 1 0 0 0 00 02
0 0 0 0 0 10 03
0 0 0 0 0 10 14
0 0 0 0 0 10 05
1 1 1 1 1 11 16
1
0
0
1
0
1
0
0 0 0 0 0 0 00 07
Paired Page
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
Page
160 errors occur within page.
(an average of 5 errors per 1024 bits)
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
A Page Must Be Retired When…
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Error Correction Schemes
8 chips
8 bits/chip
SEC/SECDED
8 accesses x 64 bits = 512 bit block
8 transfers x 8 bits64 bits
• Multiple error correction– 576 bits of storage– 512 data bits– At most 9 corrections possible (Hamming bound)
Error Correction SchemesNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
A 576-bit block (holding 512 data bits)suffers its 10th error
A Page Must Be Retired When…
• 4kByte (32Kbit) page size• 32Byte (512 bit) row size• 1 Rank• 8 Chips per rank• x8 Bit lines per chip• 108 mean writes until memory cell failure• .25 coefficient of variance for cell lifetimes
Experimental Parameters
Coefficient of variance = 0.25
ResultsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Half capacity lost on first error(when pages are paired)
ResultsWrite Modification Width
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
512 bit block(16 x 32 bit words)
Entire 512 bit region modified (each bit flips with p=0.5)
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 00
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Error Correction Schemes
A Review of Hamming Codes
Flipping just one data bit changes, on average,half of the hamming code bits
ResultsWrite Modification Width
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
01011111110110000000001011111000
512 bit block(16 x 32 bit words)
256 bit modification128bit mod
Coefficient of variance = 0.25
Write Modification Width of 512 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Coefficient of variance = 0.25
Write Modification Width of 256 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Coefficient of variance = 0.25
Write Modification Width of 128 BitsNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Perfect codes would always win if you leveled the wear of the correction bits and data bits
Fair Treatment of Perfect Codes
No correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Coefficient of variance = 0.25
Write Modification Width of 128 Bits Internal W
ear Leveling Inte
rnal
Wea
r Lev
elin
g
Coefficient of variance = 0.25
Write Modification Width of 256 Bits Internal W
ear Leveling Inte
rnal
Wea
r Lev
elin
gNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
Coefficient of variance = 0.25
Write Modification Width of 512 Bits Internal W
ear Leveling Inte
rnal
Wea
r Lev
elin
gNo correction Pairing8SEC64 ECP6 Wilkerson4 Perfect_Code9
• Correction pointers >> Pairing
• Correction pointers >> SEC over small blocks– Precedence rules >> SEC over correction entries
• Correction pointers >= MEC over large blocks– Lower computational cost
• For memory with both hard and transient errors– Use ECP below, on chip (for hard errors)– Use SEC above, off chip (for soft errors)
Conclusion