revisi&ng)checkpoin&ng)for)exascale6class) systems) · sandia national laboratories is a...
TRANSCRIPT
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
Revisi&ng)Checkpoin&ng)for)Exascale6Class)Systems)!
!Kurt!B.!Ferreira!
1423!00!Scalable!System!So9ware!SNL$!
April!29,!2015!$
SAND!201503382!C!
CheckpoinHng!is!Dead!!!!!Ef
ficie
ncy
Nodes
0 %
20 %
40 %
60 %
80 %
100 %
1K 2K 4K 8K 16K32K
64K128K
256K512K
Node!MTBF:!25!years,!Checkpoint/Restart!DuraHon:!5!minutes,!!Checkpoint!Frequency:!OpHmal!Due!to!Young!!
Ferreira,$Salishan$2015$
!…!but,!there!are!flaws!in!the!last!plot!!
• Does$not$account$for$state9of9the9art$in$checkpoin<ng:$
• Storage!OpHmizaHons:$Burst9buffers,$mul<9level$checkpoin<ng$(SCR),$etc$• Checkpoint!opHmizaHons:$checkpoint$compression,$incremental$checkpoin<ng,$etc$• Decreased!fault!rates:$SEC9DED$Vs.$Chipkill$ECC$DRAM$protec<on$
• Based$on$reliability$analysis$that$is$flawed:$counts$errors,!not!faults![1]$
• A$single$fault$generates$an$arbitrary$number$of$errors$(0$9>$infinite)$• Errors$are$more$indica<ve$of$the$OS’s!polling!frequency!or$applicaHon!access!
paYern!than$of$hardware$health$$• Errors$overemphasize$the$effects$of$permanent!faults,$which$can$lead$to$thousands$
or$millions$of$errors$$
Ferreira,$Salishan$2015$
[1]$Sridharan$et$al,$Memory!Errors!in!Modern!Systems:!The!Good,!The!Bad,!and!The!Ugly$In:$20th$Interna,onal$Conference$on$Architectural$Support$for$Programming$Languages$and$Opera,ng$Systems$(ASPLOS)$ACM!
!Incorrect!Methodology!=>!Incorrect!Conclusions!on!Reliability!
• Example:$• Hopper’s$DRAM$error$rate$was$4x$greater$than$Cielo’s$!$Cielo$is$more$reliable$
• Reality:!Hopper’s$DRAM$fault$rate$was$37%$lower$than$Cielo’s$
Ferreira,$Salishan$2015$
!RevisiHng!Checkpoint/Restart:!Take!Aways!!!
Ferreira,$Salishan$2015$
• Fast$IO$a$requirement$(burst$buffers,$etc)$$• Checkpoint/restart$can$be$a$viable$op<on$if$per9device$(per9MB)$reliability$remains$
similar$to$observed$today.$
• Checkpoint/restart$efficiency$can$remain$above$80%$if$DRAM$reliably$drops$by$20X.$
• At$current$failure$rates,$improving$DRAM$reliability$has$greater$benefit$than$SRAM.$
• Improvements$to$both$DRAM$(double$chipkill)$and$SRAM$(SEC9DED$ECC)$reliability,$can$lead$to$checkpoint$efficiencies$similar$to$observed$today.$$
What!do!we!need!from!hardware!to!keep!checkpoint/restart!viable?!!
!!!!Sources!for!Failure!Data!
Ferreira,$Salishan$2015$
Cielo!at!Los!Alamos!NaHonal!Lab! Hopper!at!NERSC!/!Lawrence!Berkeley!NaHonal!Lab!
Two!ProducHon!systems!500M+!CPU!socket0hours!40B+!DRAM!device0hours!
Cielo! Hopper!
Nodes$ 8568$ 6000$
Sockets$/$Node$ 2$ 2$
Cores$/$Socket$ 8$ 12$
DIMMs$/$Socket$
4$ 4$
Loca<on$ Los$Alamos,$NM$ Oakland,$CA$
Al<tude$ 7,320$e$ 43$e$
DRAM$ECC$ Chipkill9correct$ Chipkill9detect$
!Exascale!Strawman!!!
Ferreira,$Salishan$2015$
Cielo! Exascale!!
Nodes$ 8000$ 50,000$9$100,000$
Cores$ 144,000$ 1$billion$
DRAM$Aggregate$Capacity$$ 280$TB$ 32$9$128$PB$
DRAM$Device$Capacity$ 8$Gbit$ 8,$16,$32$Gbit$
SRAM$Capacity$/$Node$ 18$MB$ 150$MB$
SRAM$Protec<on$ Parity$ Parity$or$SEC9DED$ECC$
DRAM$ECC$ Chipkill9correct$ Chipkill9correct$or$double$Chipkill$correct$
Checkpoint$Bandwidth$ 160$GB/sec$ 10920$TB/sec$
!Per0Device/Per!MB!Reliability!
Ferreira,$Salishan$2015$
Channe
l$0$
DRAM$0$
DRAM$1$
DRAM$2$
DRAM$17$
DRAM$0$
DRAM$1$
DRAM$2$
DRAM$17$
• Unless$otherwise$stated,$we$assume:$
• Per9DRAM$device$reliability$@$exascale$==$Cielo$per9device$reliability$
• SRAM$reliability$per$MB$also$remains$same$as$today$
• Therefore,$decreased$reliability$due$to$drama<c$component$count$increases$
• Historically,$vendors$have$increased$per9device$reliability$while$increasing$capacity$$
!Failure!Rates!
Ferreira,$Salishan$2015$
• Base,$non9scaling$FIT$rate$(from$Cielo)$
• FIT$rate$that$scales$up$• DRAM:$scales$with$number$of$DRAM$devices$(from$Cielo)$• SRAM:$scales$per$MB$of$SRAM$(Cielo$and$Hopper)$
• For$Cielo,$base$FIT$rate$dominates$
• At$scale,$DRAM$and$SRAM$dominate$depending$on$configura<on$$$$
FIToverall = FITbase +FITDRAM +FITSRAM
!AssumpHon:!SDC!not!a!major!concern!
Ferreira,$Salishan$2015$
• All$bets$off$with$checkpoin<ng$if$veracity$a$concern$
• Straight$forward$to$protect$regular$storage$arrays$(DRAM,$SRAM)$
• More$ECC$bits$• Bit$interleaving$
• Current$cost:$~20%$addi<onal$circuits,$~25%$addi<onal$power$
• Protec<ng$logic$(ALUs)$is$different$story,$and$ignored$in$this$talk$
Sridharan$et$al,$Memory!Errors!in!Modern!Systems:!The!Good,!The!Bad,!and!The!Ugly$In:$20th$Interna,onal$Conference$on$Architectural$Support$for$Programming$Languages$and$Opera,ng$Systems$(ASPLOS)$ACM$
!Faults!are!ExponenHally!Distributed!
Ferreira,$Salishan$2015$
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●
●
●
●
●
0 500 1000 1500 2000
0500
1000
2000
QQ−plot distr. Weibull
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
0 500 1000 1500 2000
0200
400
600
800
QQ−plot distr. Normal
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●
0 500 1000 1500 2000
0200
400
600
800
QQ−plot distr. Gamma
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●
●
●
●
0 500 1000 1500 2000
0100
200
300
400
QQ−plot distr. Exponetial
• Previous$work$suggested$Weibull$as$a$bener$fit$for$distribu<on$
• Though$a$few$significant$outliers$exist,$Exponen<al$appears$to$be$the$bener$fit$
• We$are$inves<ga<ng$this$difference$$
!Time!to!SoluHon!due!to!Daly[2]!
Ferreira,$Salishan$2015$
Tw:$Time$to$solu<on$with$failures$Ts:$Time$to$solu<on$without$failures!Msys:$System$mean$<me$between$failures$R:!Restart$<me$τ:!Checkpoint$frequency$δ:!Checkpoint$commit$<me!
Tw =Msys ∗eR
Msys eτ+δMsys −1
#
$%%
&
'((Tsτ
!Workload!CharacterisHcs!
Ferreira,$Salishan$2015$
• Time$to$solu<on$without$failures$(Ts):$168$hours$
• Per9process$checkpoint$sizes$• Applica<on9level$checkpoints$• Collected$from$five$DOE$produc<on$applica<ons$• Per9process$sizes$of$minimum$and$maximum$vary$by$order$
of$magnitude$
• Checkpoint$Bandwidths:$Cielo$(160GB/sec),$Trinity$(~1TB/sec),$and$an$Exascale$strawman(10920TB/sec)$
• Applica<on$Efficiency$$
Efficiency = TsTw*100
!!!!!!!
Ferreira,$Salishan$2015$
CheckpoinHng!at!Exascale!with!Current!FIT!Rates!
!Uncorrected!DRAM!Error!Rate!for!Strawman!!!!!!
Ferreira,$Salishan$2015$
Efficiency!@!32PB!>=!90%!!!!!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!Uncorrected!DRAM!Error!Rate!!!!!
Ferreira,$Salishan$2015$
Efficiency!@!128PB!>!80%!!!!!!!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!Comparison!of!Efficiency!@!32PB!and!128PB!!!!!!!!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
32PB128PB
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!!!!!!!
Ferreira,$Salishan$2015$
How!unreliable!can!DRAM!get!and!remain!80%!Efficient?!!!
!
20X!Decrease!in!DRAM!Reliability!Yields!LiYle!Cost!!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
DRAM 1XDRAM 10XDRAM 20X
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!!!!!!!
Ferreira,$Salishan$2015$
What!if!DRAM!reliability!increases?!!!!
!Greater!than!10X!Increase,!Yields!LiYle!Benefit!!!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
DRAM 1XDRAM +10X DRAM +50X
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!!!!!!!
Ferreira,$Salishan$2015$
Impact!of!SRAM!Reliability!!
!Uncorrected!SRAM!Rate!for!Strawman!!
Ferreira,$Salishan$2015$
8QFRUUHFWHG�(UURU�5DWH
�5HODWLYH�WR�&LHOR�
���
����
�����
6PDOO
/DUJH
6PDOO��6FDOHG�
/DUJH��6FDOHG�
6PDOO��(&&�
/DUJH��(&&�
�����
������
�����
�����
���� ����
• Node$Count$• Small:$50,000$• Large:$100,000$
• Scaled:$SRAM$FIT$rate$decreases$50%$over$a$number$of$genera<ons$[3]$
• ECC:$SEC9DED$protec<on$for$SRAM!
!Uncorrected!SRAM!Rate!for!Strawman!!
Ferreira,$Salishan$2015$
8QFRUUHFWHG�(UURU�5DWH
�5HODWLYH�WR�&LHOR�
���
����
�����
6PDOO
/DUJH
6PDOO��6FDOHG�
/DUJH��6FDOHG�
6PDOO��(&&�
/DUJH��(&&�
�����
������
�����
�����
���� ����
• Node$Count$• Small:$50,000$• Large:$100,000$
• Scaled:$SRAM$FIT$rate$decreases$50%$over$a$number$of$genera<ons$[3]$
• ECC:$SEC9DED$protec<on$for$SRAM!
!SRAM!Reliability!Has!LiYle!Impact!on!Efficiency!!!!
Ferreira,$Salishan$2015$
8QFRUUH
FWHG�(UUR
U�5DWH
�5HODWLYH
�WR�&LHOR�
���
����
�����
6PDOO
/DUJH
6PDOO��6FDOHG�
/DUJH��6FDOHG�
6PDOO��(&&�
/DUJH��(&&�
�����
������
�����
�����
���� ����
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
SRAM CieloSRAM -50%SRAM ECC
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
!!!!!!!
Ferreira,$Salishan$2015$
What!if!we!can!increase!reliability!of!both!DRAM!and!SRAM?!!
SRAM!ECC!+!DRAM!double!Chipkill!≈!Efficiency!Today!
Ferreira,$Salishan$2015$
Appl
icat
ion
Effic
ienc
y (%
)
Aggregate Checkpoint Bandwidth (B/sec)
DRAM +10XDRAM +10X, SRAM ECC
0
20
40
60
80
100
160GB 1TB 1.5TB 10TB 15TB 20TB
Cielo Trinity Exascale
Summary!
• Importance$of$using$the$current$state9of9the9art$in$evalua<ng$the$future$of$checkpoint/restart$
$• $Checkpoin<ng$may$have$a$future$at$exascale,$it$but$comes$at$
a$cost:$
• Fast$IO$a$requirement$(burst$buffers,$SCR,$etc)$• We$can$afford$a$20X$decrease$in$DRAM$reliability$and$s<ll$
remain$80%$efficient$• At$current$FIT$rates,$improved$DRAM$reliability$more$
beneficial$than$SRAM$• DRAM$double9chipkill$correct$and$SRAM$SEC9DED$ECC$
can$lead$to$checkpoint$efficiencies$similar$to$those$observed$on$current$systems.$
$$$Ferreira,$Salishan$2015$
Thanks!!!
Acknowledgements:!• SNL:$Scon$Levy,$Jon$Stearley,$and$Patrick$Widener$• UNM:!Dorian$Arnold$and$Patrick$Bridges$• LANL:!Nathan$DeBardeleben,$Sean$Blanchard,$and$Qiang$Guan$$• AMD:$$Vilas$Sridharan$and$Sudhanva$Gurumurthi$• NERSC:$John$Shalf$$
Ferreira,$Salishan$2015$
References!!
[1]$Vilas$Sridharan,$Nathan$DeBardeleben,$Sean$Blanchard,$Kurt$B$Ferreira,$$$$$$$$$Jon$Stearley,$John$Shalf,$Sudhanva$Gurumurth$(2015)$$Memory!Errors!in!Modern!!!!!!!!Systems:!The!Good,!The!Bad,!and!The!Ugly$In:$20th$Interna,onal$Conference$on$$$$$$$$Architectural$Support$for$Programming$Languages$and$Opera,ng$$$$$$$$Systems(ASPLOS)$ACM.$$$[2]$J.$T.$Daly,!!A!higher!order!esHmate!of!the!opHmum!checkpoint!interval!for!!!!!!!restart!dumps,$In.$Future$Genera<on$Computer$Systems,$2006.$$[3]$E.$Ibe,$H.$Taniguchi,$Y.$Yahagi,$K.$i.$Shimbo,$,$and$T.$Toba.!Impact!of!scaling!!!!!!!on!neutron0induced!so9!error!in!SRAMs!from!a!250!nm!to!a!22!nm!design!rule.$$$$$$$$In$IEEE$Transac,ons$on$Electron$Devices,$pages$1527–1538,$Jul.$2010.$$$
Ferreira,$Salishan$2015$