practicalhardening of(crashtolerant(systems( · 2020. 8. 22. · dependabilityin(data(centers 0...

64
Practical Hardening of CrashTolerant Systems Marco Sera)ini (Yahoo! Research BCN) Joint work with: Miguel Correia (ISTUTL / INESCID) Daniel Gómez Ferro and Flavio Junqueira (Yahoo! Research BCN)

Upload: others

Post on 20-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Practical  Hardening    of  Crash-­‐Tolerant  Systems  

Marco  Sera)ini  (Yahoo!  Research  BCN)  Joint  work  with:  Miguel  Correia  (IST-­‐UTL  /  INESC-­‐ID)  

Daniel  Gómez  Ferro  and  Flavio  Junqueira  (Yahoo!  Research  BCN)  

Page 2: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Dependability  in  data  centers  0 Crashes  are  commonplace…  0 …  but  scarier  faults  do  occur  

Page 3: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

A  horror  story  An  8-­‐hour  system-­‐wide  outage  due  to  a  single  hardware  fault  

Page 4: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

What  happened?  0 Quoted  from  the  Amazon  service  health  dashboard  0  “A  handful  of  messages  had  a  single  bit  corrupted”  0  “The  message  was  still  intelligible,  but  the  system  state  information  was  incorrect”  

0  “We  used  MD5  checksums  throughout  the  system  (but  not)  for  this  particular  internal  state  information”  

0  “(The  corruption)  spread  throughout  the  system  causing  the  symptoms  described  above”  

 

Page 5: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

Process  i  

Page 6: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

Process  i  

Page 7: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

min  

Process  i  

Page 8: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

mout  

Event  handling  

min  

Process  i  

Page 9: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

mout  

Event  handling  

min  

Process  i  

Page 10: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

mout  

Event  handling  

min  

min  

x  

y  

Process  i   Process  j  

Page 11: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  

u  

v  

mout  

Event  handling  

min  

min  

x  

y  

Event  handling  

Process  i   Process  j  

Page 12: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

A  new  approach  to    error  isolation  

u  

v  

mout  

Event    handling  

min  

min  

x  

y  

Event  handling  

Process  i   Process  j  

1.  General  model  of  process  behavior  2.  Arbitrary  State  Corruption  (ASC)  fault  model  3.  Guarantee  error  isolation  through  hardening  

Page 13: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

A  new  approach  to    error  isolation  

u  

v  

mout  

Event    handling  

min  

min  

x  

y  

Event  handling  

Process  i   Process  j  

1.  General  model  of  process  behavior  2.  Arbitrary  State  Corruption  (ASC)  fault  model  3.  Guarantee  error  isolation  through  hardening  

Page 14: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

A  new  approach  to    error  isolation  

u  

v  

mout  

Event    handling  

min  

min  

x  

y  

Event  handling  

Process  i   Process  j  

1.  General  model  of  process  behavior  2.  Arbitrary  State  Corruption  (ASC)  fault  model  3.  Guarantee  error  isolation  through  hardening  

Page 15: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

A  new  approach  to    error  isolation  

u  

v  

mout  

Event    handling  

min  

min  

x  

y  

Event  handling  

Process  i   Process  j  

1.  General  model  of  process  behavior  2.  Arbitrary  State  Corruption  (ASC)  fault  model  3.  Guarantee  error  isolation  through  hardening  

Page 16: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Data  corruptions  0 Commodity  disks  are  known  to  be  unreliable  0 Faulty  [irmware  is  the  [irst  reason  

0 RAM:  ECC  errors  are  frequent  0 Production  machines  only  see  detected  errors    Coverage  not  known  

0 Interconnects  and  CPUs  also  fail  0 Faulty  drivers  or  bit  [lips  

Page 17: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Existing  approaches  Background  

Page 18: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Common  practice  0 Manual  placement  of  error  detection  checks  0 Application  knowledge  0 Time  consuming  

0 Hard  to  structure  without  fault  model  

0 No  error  isolation  guarantee    

Page 19: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Byzantine  fault  model  0 Black-­‐box  model  of  faulty  processes:  adversarial  0 Hardening  for  error  isolation  [Nysiad  NSDI  2008]  0 Based  on  state  machine  replication  0 Replication  and  performance  costs  

Servers  

Client  

Page 20: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Byzantine  faults  0 Byzantine  hardening  covers  attacks  and  bugs…  0 …  assuming,  e.g.,  design  diversity  of  replicas  0 Unpractical  in  most  systems  

Attacks   Bugs   Data  corruptions  

Page 21: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Byzantine  faults  0 Byzantine  hardening  covers  attacks  and  bugs…  0 …  assuming,  e.g.,  design  diversity  of  replicas  0 Unpractical  in  most  systems  

Attacks  

Security  

Bugs  

V  &  V  

Data  corruptions  

ASC  Hardening  

Page 22: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Process  and  fault  models  De[ining  Arbitrary  State  Corruptions  

Page 23: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Process  model  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

min  

mout  

1)  Event  Dispatching  

2)  Event  Handling  

3)  Message  sending  

State  

Page 24: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

ASC  fault  model  0 An  Arbitrary  State  Corruption  can  make  a  process    0 Crash  0 Assign  an  arbitrary  value  to  any  variable  0  Start  the  execution  from  an  arbitrary  instruction  

v   5  

z   10  

PC   20  

v   12  

z   7  

PC   320  

Page 25: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Fault  frequency  0 One  fault  for  every  processed  input  message  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u= r + v;!!v = u;!!send <WRITE, v> to process p!

min  

mout  

1)  Event  Dispatching  

2)  Event  Handling  

3)  Message  sending  

State  

Page 26: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Fault  diversity  0 A  corrupted  variable  is    different  from  its  replica  

0 Only  holds  immediately  after  the  fault  0 Can  be  invalidated  if  instructions  modify  the  variable  

v   5  

z   10  

PC   20  

v   12  

z   7  

PC   320  

5  

10  

5  

41  

original   replica   original   replica  

Page 27: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  0 Fault  diversity  does  not  hold  0 Hardening  preserves  diversity  

u  

v  

Original   Replica  

Page 28: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  0 Fault  diversity  does  not  hold  0 Hardening  preserves  diversity  

u  

v  

Original   Replica  

Page 29: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  0 Fault  diversity  does  not  hold  0 Hardening  preserves  diversity  

u  

v    ?  

Original   Replica  Fault    diversity  

Page 30: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Error  propagation  0 Fault  diversity  does  not  hold  0 Hardening  preserves  diversity  

u  

v  

Original   Replica  Fault    diversity  

 !  

Page 31: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

ASC  hardening  From  ASC  faults  to  crashes  and  message  omissions  

Page 32: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

From  ASC  to  crashes  0 Transparent:  to  the  hardened  process  0 Local:  no  process  replication  on  multiple  machines  0 Untrusted:  can  have  faults  while  executing  hardening  

HARDENING  RUNTIME  

u  

v  

mout  

Event  handling  

min  

Page 33: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

Original   Replica  

Page 34: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

Fault    diversity  

Original   Replica  

Page 35: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

mout  

Event  handling  

Fault    diversity  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

Page 36: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

mout  

Event  handling  

Fault    diversity  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Page 37: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

mout  

Event  handling  

Fault    diversity  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Page 38: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Error propagation!

Page 39: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Simple  hardening  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Error propagation!

Page 40: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Protecting  computation  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Redundant  event  handling  

Page 41: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Protecting  computation  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Redundant  event  handling  

Page 42: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Checking  the  checkers  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Page 43: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Checking  the  checkers  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

Page 44: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Checking  the  checkers  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

Redundant  event  handling  

CHECK  

Page 45: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Checking  the  checkers  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!

CHECK  

Redundant  event  handling  

Page 46: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Issue  with  redundant  checks  

u  

v  

min  

Original   Replica  

Page 47: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Issue  with  redundant  checks  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

Page 48: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Issue  with  redundant  checks  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

CHECK  

Redundant  event  handling  

Page 49: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Issue  with  redundant  checks  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

CHECK  

Redundant  event  handling  

Page 50: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Incremental  buffer  

u  

v  

mout  

Event  handling  

min  

Original   Replica  

CHECK  

Redundant  event  handling  

Page 51: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Control  [low  errors  0 A  control  [low  error  may  subvert  the  execution  0 An  event  handler  could  be  executed  twice  0 Event  handling  may  be  skipped  or  incomplete  

0 Requires  control  )low  checks  0 Use  [lags  to  control  the  control  [low  0 Very  lightweight  

Page 52: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

PASC  runtime  

EH1   EH2   EH3  

Process  state  

PASC  checks  

PASC  library  

User-­‐  de[ined  

Transparent  

Replica  state  

Page 53: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

PASC  runtime  

EH1   EH2   EH3  

Process  state  

PASC  checks  

PASC  library  

User-­‐  de[ined  

Transparent  

Replica  state  

Page 54: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

PASC  runtime  

EH1   EH2   EH3  

Process  state  

PASC  checks  

PASC  library  

User-­‐  de[ined  

Transparent  

github.com/yahoo/pasc  

Replica  state  

Page 55: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Evaluation  

Page 56: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Hardening  an  echo  server  

0 Little  computation,  network  bound,  no  overhead  0 PBFT  is  a  reference  (Nysiad  not  available)  

0

0.5

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100 120 140 160

Late

ncy in m

s

Throughput in Kops/s

PBFTPASC Echo

Unprot. Echo

Page 57: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Hardening    State  Machine  Replication  

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140

Late

ncy

in m

s

Throughput in Kops/s

PBFTPASC Paxos

Unprot. Paxos

Page 58: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Hardening    State  Machine  Replication  

+  70  %  -­‐  15  %  

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140

Late

ncy

in m

s

Throughput in Kops/s

PBFTPASC Paxos

Unprot. Paxos

Page 59: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Zookeeper  

0

5

10

15

20

25

0 10 20 30 40 50 60 70

La

ten

cy in

ms

Throughput in Kops/s

PASC ZooKeeperUnprot. ZooKeeper

Page 60: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Memory  overhead  

0

200

400

600

800

1000

1200

1400

1600

1800

0 1K 4K

Mem

ory

usage in M

B

Request size in bytes

Unprot. PaxosPASC Paxos

Page 61: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Scalability  

0 SimpleKV:  eventually  consistent  store,  no  replication  0  Scales  similarly  with  hardening  0 No  server  “wasted”  for  replication  

0  

10  

20  

30  

40  

50  

60  

70  

80  

90  

100  

1   3   5   7  

Max

. th

roughput

(kops/

sec)

Number of servers

PASC sKV

Unprot. sKV

Page 62: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

PASC  fault  coverage  0 Injected  random  bit  [lips  in  Paxos  0 Code  corruptions:  bytecode  and  binary  code  0  State  corruptions:  pointers  and  primitive  values  

 Code  corruptions   State  corruptions  Unprot   PASC   Unprot   PASC  

Undet.   3   0   93   0  Det.   -­‐   1   -­‐   330  Crash   1640   1663   2301   2066  Total   2856   2856   5237   5237  

Page 63: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Conclusions  0 Hardware  data  corruptions  occur  in  data  centers  0 Model  as  Arbitrary  State  Corruptions  0 ASC-­‐hardening  algorithm  for  error  isolation    0 Local:  does  not  require  replication  

0 PASC:  ASC-­‐hardening  library  0 Ef[icient:  PASC-­‐Paxos  has  up  to  70%  more  throughput  than  PBFT  

0 High  fault  coverage  

Page 64: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur

Thank  you    

sera)ini@yahoo-­‐inc.com    

github.com/yahoo/pasc