flash&reliability&in&produc4on:&...

38
Flash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability Bianca Schroeder University of Toronto (Currently on sabbatical at Microsoft Research Redmond)

Upload: others

Post on 24-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

Flash  Reliability  in  Produc4on:    

The  Importance  of  Measurement  and  Analysis  in  Improving  System  Reliability  

Bianca Schroeder  University of Toronto

(Currently on sabbatical at Microsoft Research Redmond)

Page 2: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

•  System  reliability  •  Why  and  how  do  systems  fail  in  the  wild?      

Page 3: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

Data  from  a  large  number  of  large-­‐scale  produc4on  systems  at  different  organiza4ons:  

Page 4: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

 ▪ Different  hardware  failure  events  

▪ Hardware  replacements  ▪ Correctable  and  uncorrectable  errors  in  DRAM  ▪ Server  outages  ▪ Hard  disk  drive  failures  ▪ Sector  errors  in  hard  disk  drives  ▪ Data  corrup4on  in  storage  systems  ▪ Failures/errors  in  solid  state  drives  

▪ Job  logs  • Google,  OpenCloud  (Hadoop  cluster  at  CMU),  Yahoo!  Hadoop  trace  

 

4

▪  Observa4ons  oTen  different  from  expecta4ons    ▪  Surprising  to  operators  as  well  as  manufacturers  

Page 5: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

5

¡  Why  flash  reliability?  §  More  and  more  data  is  living  on  flash            =>    data  reliability  depends  on  flash  reliability  §  Worry  about  flash  wear-­‐out  

¡  For  a  long  4me  only  lab  studies  using  accelerated  tes4ng  

¡  Recently,  some  field  studies:      ▪  Sigmetrics’15  (Facebook)  Meza  et  al.  ▪  FAST’16  (Google)  Schroeder  et  al.  ▪  Systor’17  (MicrosoT)  Narayanan  et  al.  

Page 6: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

6

Google fleet

10 drive models

4 chip vendors MLC, SLC, eMLC

6 years of data

Data on workload and variety of error types

Data on repairs, replacements, bad blocks & bad chips

•  Custom drives based on commodity chips (but custom firmware and FTL)

•  Drives are reporting counters many times per day

Page 7: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

7

¡  Percentage  of  drives  replaced  annually  due  to  suspected  hardware  problems  over  the  first  4  years  in  the  field:  

0  1  2  3  4  5  6  

MLC-­‐A  MLC-­‐B  MLC-­‐C  MLC-­‐D   SLC-­‐A   SLC-­‐B   SLC-­‐C   SLC-­‐D  

 Average  annual    replacement  

rates  for  hard  disks  (2-­‐20%)  

Percen

tage(%

)  

Consistent with [Narayanan’17]

§ Good  news:    §  Only  1-­‐2%  of  drives  replaced  annually  -­‐-­‐    much  lower  than  hard  disks!  §  Drives  benefiAed  from  ability  to  tolerate  chip  failure  

§  0.5-­‐1.5%  of  drives  developed  bad  chips  per  year  

Page 8: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

8

§ Bad  news:  Uncorrectable  errors  common  §  26-­‐60%  of  drives  see  uncorrectable  errors  in  their  life  (Google)  

§  2-­‐6  out  of  1,000  drive  days  experience  uncorrectable  errors  

§  0.2-­‐75%  of  drives  at  Facebook  [Meza  et  al.  2015]  §  Rates  at  MicrosoT  10X  higher  than  target  rate  [Narayanan  et  al.  2016]  

§ Much  worse  than  for  hard  disk  drives  (3.5%  experiencing  sector  errors)!  

§  These  errors  are  insideous  as  they  are  latent.  

Page 9: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

9

¡  Wear-­‐out  (limited  program  erase  cycles)  ¡  Technology  (MLC,  SLC)  ¡  Lithography  ¡  Age  ¡  Workload  ¡  Temperature  ¡  Other  factors?  

¡  What  reliability  metric  to  use?  §  Raw  bit  error  rate  (RBER)  ▪  Assump4on:  as  raw  bit  errors  accumulate  they  turn  uncorrectable  

§  Probability  of  uncorrectable  errors  ▪  Why  not  UBER  –  we  will  see  …  

Page 10: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

Common  expecta4on:  Exponen4al  increase  of  RBER  with  PE  cycles  …  or  maybe  polynomial    …  or  other  super-­‐linear?  

10

-­‐-­‐-­‐  Exponential    growth  

PE  cycles  

RBER

 

Page 11: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

11

§ Big  differences  across  models  (all  drives  use  same  ECC  &  FTL,  so  differences  are  not  due  to  ECC)  

§ Linear  increase  (for  range  of  PE  cycles  in  our  data)  § No  sudden  increase  after  PE  cycle  limit  

Page 12: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

12

Common  expecta4on:  Lower  error  rates  under  SLC  ($$$)  than  MLC  

Page 13: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

13

§ RBER  is  lower  for  SLC  drives  than  MLC  § Uncorrectable  errors  are  not  lower  for  SLC  drives  (all  drives  use  same  ECC,  FTL,  etc.  so  differences  are  not  due  to  ECC)  

§ SLC  drives  don’t  have  lower  rate  of  repairs  or  replacement  

Red  lines    are  SLC  drives  

Page 14: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

14

Common  expecta4on:  Higher  error  rates  for  smaller  feature  size  

Page 15: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

15

§ Smaller  lithography  =>  higher  RBER  § Lithography  has  less  impact  on  uncorrectable  errors  

43nm  versus  50nm  

34nm  versus  50nm  

Page 16: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

Common  expecta4on:  Exponen4al  increase  in  hardware  failures  with  temperature  (Arrhenius  equa4on)  

16

-­‐-­‐-­‐  Exponential    growth  

Temperature  

RBER

 

Page 17: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

17

§ Uncorrectable  errors  might  increase,  decrease  or  not  be  affected  by  temperature.  

§ Drive-­‐internal  mechanisms  protect  against  temperature,  e.g.  through  throttling.  

§ Other  effects  might  dominate  

Increase Decrease!

Little effect

Page 18: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

18

¡  Lab  studies  find  workload  induced  error  modes  §  Read  &  program  disturb  errors,  incomplete  erases  

¡  Field  data:  no  correla4on  between  read  /  write  /  erase  opera4ons  versus  errors  (Google,  Facebook)  §  Possibly  because  data  at  per-­‐drive  level  too  coarse  

¡  Consequence:  For  field  studies  UBER  is  not  a  meaningful  metric.  

                                           

UBER=  #Uncorrectable  bits  

           Total  #  bits  read  

Metrics  from  lab  studies  do  not  always  make  sense  for  field  data.  

Page 19: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

19

§ Different  RBER  for  same  model  in  different  clusters  § Other  factors  at  play  …  

Page 20: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

20

¡  The  main  purpose  of  RBER  is  as  a  metric  for  overall  drive  reliability  

¡  Allows  for  projec4ons  on  uncorrectable  errors  

[Mielke2008]

Page 21: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

21

§ Drive  models  with  higher  RBER  don’t  have  higher  frequency  of  uncorrectable  errors  

Page 22: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

22

§ Drives  (or  drive  days)  with  higher  RBER  don’t  have  higher  frequency  of  uncorrectable  errors  

§ RBER  is  not  a  good  predictor  of  field  reliability  § Uncorrectable  errors  caused  by  other  mechanisms  than  corr.  errors?  

Page 23: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

23

§ Prior  errors  highly  predictive  of  later  uncorrectable  errors  § Can  we  predict  uncorrectable  errors?  

 

Page 24: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

24

 

¡  Drives  report  many  opera4onal  sta4s4cs  e.g.  through  S.M.A.R.T:  §   Workload,  temperature,  power-­‐on-­‐hours,  prior  errors,  etc.    

Time

?

now  ¡  Based  on  data  from  interval  n,  will  there  be  uncorrectable  errors  in  interval  n+1?  

SMART1

SMART2

SMART254

Page 25: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

25

 

¡  Common  machine  learning  techniques  for  classifica4on  problems:  §  Classifica4on  and  regression  trees  §  Random  forests  §  Logis4c  regression  §  Support  vector  machines  §  Neural  networks  

¡  How  does  predic4on  accuracy  compare?  ¡  How  can  we  use  predic4ons  in  prac4ce?  

Increasing complexity

Page 26: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

                   

Betterr  

 

¡  Trained  models  for  Google  SSDs  &  Backblaze  HDD  data  ¡  Metrics:  

§  False  posi4ve  rate  (rate  of  false  alarms)  §  False  nega4ve  rate  (frac4on  of  errors  not  caught)  

Page 27: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

 

¡  A  simple  classifier  performs  best:  random  forests  ¡  Accuracy  is  very  good  

§  Catch  90%  of  errors  at  false  alarm  rate  of  2-­‐8%.  §  Catch  50%  of  errors  at  false  alarm  rate  of  1%.  

Catch 90% of errors at 2% false alarm rate.

Seagate HDS722020ALA330

Page 28: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

 

¡  Accuracy  lower  than  for  HDDs.  ¡  Could  be  due  to  differences  in  hardware  or  error  repor4ng.    ¡  Will  see  that  it’s  s4ll  good  enough  for  applica4ons  we  consider.  

Page 29: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

 

¡  What  if  only  lisle  data  is  available?  §  Accuracy  nearly  iden4cal.  

¡  What  if  only  data  for  a  different  model  available?  §  Accuracy  slightly  lower,  but  s4ll  very  good.  

Seagate error predictions using Hitachi-trained model

Page 30: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

30

   

¡  Use  case:  Op4mize  disk  scrubbing  ¡ Why  scrubbing?  

§  Periodic  disk  scan  to  detect  and  correct  errors  early.  §  Standard:  Scrub  con4nuously  at  a  fixed  speed  (e.g.  once  per  week).  

§  Expensive,  in  par4cular  with  increasing  capaci4es  and  impact  on  foreground  workload  

¡  Idea:    §  Increase  scrub  rate  when  error  is  predicted  =>  reduce  mean  4me  to  error  detec4on  (MTTD)  

Page 31: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

31

 

¡  Approach:  § When  error  is  predicted  for  next  interval,  increase  scrub  speed  by  factor  X  for  this  interval.  

¡  Benefits:  §  Errors  that  were  predicted  will  be  caught  X-­‐4mes  faster,  on  average.  

¡  Cost:  §  Every  4me  an  error  is  predicted,  X-­‐4mes  more  scrub  opera4ons  will  be  issued.  

¡  Cost  &  benefit  trade-­‐off  can  be  controlled  by  choosing  X  and  tuning  FPR  of  predictor.  

 

Page 32: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

32

Page 33: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

33

 

¡  Significant  reduc4on  in  mean  4me  to  error  detec4on  at  limited  overheads.  

¡  Effec4ve  even  for  the  SSD  model  with  the  weakest  predictor.  

Nearly 2X improvement for 2% time spent accelerated.

Nearly 1.5X improvement for 1% time spent accelerated.

Page 34: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

34

 

¡  Results  when  using  Hitachi  predictor  on  Seagate  drive:  

 ¡  Even  classifier  trained  on  different  model  is  useful.  

Page 35: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

35

 

¡  Dynamically  adapt  ECC  strength  of  SSDs.  ¡  Proac4vely  re4re  blocks  or  chips.  ¡  Tuning  the  cache  policy  of  SSD  caches,  switching  from  write-­‐back  to  write-­‐through  when  chance  of  errors  is  high.  

¡  Tuning  file  system  protec4on  mechanisms.  

Page 36: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

36

 

¡  Predic4ng  whole  SSD  failures  [Narayanan  16]  ¡  Problem:  false  posi4ves  much  more  expensive  (e.g.  unnecessary  drive  replacements)  

¡  False  posi4ve  rates  of  13%  too  high.  ¡ More  work  leT  to  be  done  …  

Page 37: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

37

¡  Many  aspects  different  from  expecta4ons  §  Linear  increase  with  PE  cycles  §  RBER  not  predic4ve  of  uncorrectable  errors  §  SLC  not  generally  more  reliable  than  MLC  §  UBER  not  a  great  metric  in  the  field  ⇒ Importance  of  collec4ng  field  data  

¡  Errors  are  oTen  predictable  using  simple  ML  techniques  §  Can  be  used  to  improve  storage  system  reliability  §  Need  to  think  about  what  data  to  report/collect  §  We  have  shown  some  use  cases,  probably  many  more…  

¡  S4ll  many  open  ques4ons  …  

Page 38: Flash&Reliability&in&Produc4on:& …nvmw.ucsd.edu/nvmw2018-program/unzip/current/nvmw2018... · 2019-01-25 · Differenthardware&failure&events& Hardware&replacements& Correctable&and&uncorrectable&errors&in&DRAM

38

¡  For  more  informa4on:  ▪  [Sigmetrics’15]  (Facebook)  “A  large-­‐scale  study  of  flash  memory  failures  in  the  field”,  J.  Meza,  Q.  Wu,  S.  Kumar,  O.  Mutlu.  

▪  [FAST’16]  (Google)  “Flash  reliability  in  produc<on:  The  expected  and  the  unexpected”,  B.  Schroeder,  R.  Lagisesy,  A.  Merchant.    

▪  [Systor’17]  (MicrosoT)  “SSD  failures  in  data  centers:  The  what,  when  and  why?”,  I.  Narayanan,  D.  Wang,  M.  Jeon,  B.  Sharma,  L.  Caulfield,  A.  Sivasubramaniam,  B.  Cutler,  J.  Liu,  B.  Khessib,  K.  Vaid.  

▪  [IEEE  Proceedings’17]  (Survey)  “Reliability  of  NAND-­‐based  SSD:  What  field  studies  tell  us”,  B.  Schroeder,  R.  Lagisesy,  A.  Merchant.