![Page 1: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/1.jpg)
Practical Hardening of Crash-‐Tolerant Systems
Marco Sera)ini (Yahoo! Research BCN) Joint work with: Miguel Correia (IST-‐UTL / INESC-‐ID)
Daniel Gómez Ferro and Flavio Junqueira (Yahoo! Research BCN)
![Page 2: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/2.jpg)
Dependability in data centers 0 Crashes are commonplace… 0 … but scarier faults do occur
![Page 3: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/3.jpg)
A horror story An 8-‐hour system-‐wide outage due to a single hardware fault
![Page 4: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/4.jpg)
What happened? 0 Quoted from the Amazon service health dashboard 0 “A handful of messages had a single bit corrupted” 0 “The message was still intelligible, but the system state information was incorrect”
0 “We used MD5 checksums throughout the system (but not) for this particular internal state information”
0 “(The corruption) spread throughout the system causing the symptoms described above”
![Page 5: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/5.jpg)
Error propagation
u
v
Process i
![Page 6: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/6.jpg)
Error propagation
u
v
Process i
![Page 7: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/7.jpg)
Error propagation
u
v
min
Process i
![Page 8: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/8.jpg)
Error propagation
u
v
mout
Event handling
min
Process i
![Page 9: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/9.jpg)
Error propagation
u
v
mout
Event handling
min
Process i
![Page 10: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/10.jpg)
Error propagation
u
v
mout
Event handling
min
min
x
y
Process i Process j
![Page 11: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/11.jpg)
Error propagation
u
v
mout
Event handling
min
min
x
y
Event handling
Process i Process j
![Page 12: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/12.jpg)
A new approach to error isolation
u
v
mout
Event handling
min
min
x
y
Event handling
Process i Process j
1. General model of process behavior 2. Arbitrary State Corruption (ASC) fault model 3. Guarantee error isolation through hardening
![Page 13: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/13.jpg)
A new approach to error isolation
u
v
mout
Event handling
min
min
x
y
Event handling
Process i Process j
1. General model of process behavior 2. Arbitrary State Corruption (ASC) fault model 3. Guarantee error isolation through hardening
![Page 14: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/14.jpg)
A new approach to error isolation
u
v
mout
Event handling
min
min
x
y
Event handling
Process i Process j
1. General model of process behavior 2. Arbitrary State Corruption (ASC) fault model 3. Guarantee error isolation through hardening
![Page 15: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/15.jpg)
A new approach to error isolation
u
v
mout
Event handling
min
min
x
y
Event handling
Process i Process j
1. General model of process behavior 2. Arbitrary State Corruption (ASC) fault model 3. Guarantee error isolation through hardening
![Page 16: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/16.jpg)
Data corruptions 0 Commodity disks are known to be unreliable 0 Faulty [irmware is the [irst reason
0 RAM: ECC errors are frequent 0 Production machines only see detected errors Coverage not known
0 Interconnects and CPUs also fail 0 Faulty drivers or bit [lips
![Page 17: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/17.jpg)
Existing approaches Background
![Page 18: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/18.jpg)
Common practice 0 Manual placement of error detection checks 0 Application knowledge 0 Time consuming
0 Hard to structure without fault model
0 No error isolation guarantee
![Page 19: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/19.jpg)
Byzantine fault model 0 Black-‐box model of faulty processes: adversarial 0 Hardening for error isolation [Nysiad NSDI 2008] 0 Based on state machine replication 0 Replication and performance costs
Servers
Client
![Page 20: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/20.jpg)
Byzantine faults 0 Byzantine hardening covers attacks and bugs… 0 … assuming, e.g., design diversity of replicas 0 Unpractical in most systems
Attacks Bugs Data corruptions
![Page 21: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/21.jpg)
Byzantine faults 0 Byzantine hardening covers attacks and bugs… 0 … assuming, e.g., design diversity of replicas 0 Unpractical in most systems
Attacks
Security
Bugs
V & V
Data corruptions
ASC Hardening
![Page 22: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/22.jpg)
Process and fault models De[ining Arbitrary State Corruptions
![Page 23: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/23.jpg)
Process model
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
min
mout
1) Event Dispatching
2) Event Handling
3) Message sending
State
![Page 24: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/24.jpg)
ASC fault model 0 An Arbitrary State Corruption can make a process 0 Crash 0 Assign an arbitrary value to any variable 0 Start the execution from an arbitrary instruction
v 5
z 10
PC 20
v 12
z 7
PC 320
![Page 25: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/25.jpg)
Fault frequency 0 One fault for every processed input message
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u= r + v;!!v = u;!!send <WRITE, v> to process p!
min
mout
1) Event Dispatching
2) Event Handling
3) Message sending
State
![Page 26: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/26.jpg)
Fault diversity 0 A corrupted variable is different from its replica
0 Only holds immediately after the fault 0 Can be invalidated if instructions modify the variable
v 5
z 10
PC 20
v 12
z 7
PC 320
5
10
5
41
original replica original replica
![Page 27: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/27.jpg)
Error propagation 0 Fault diversity does not hold 0 Hardening preserves diversity
u
v
Original Replica
![Page 28: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/28.jpg)
Error propagation 0 Fault diversity does not hold 0 Hardening preserves diversity
u
v
Original Replica
![Page 29: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/29.jpg)
Error propagation 0 Fault diversity does not hold 0 Hardening preserves diversity
u
v ?
Original Replica Fault diversity
![Page 30: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/30.jpg)
Error propagation 0 Fault diversity does not hold 0 Hardening preserves diversity
u
v
Original Replica Fault diversity
!
![Page 31: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/31.jpg)
ASC hardening From ASC faults to crashes and message omissions
![Page 32: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/32.jpg)
From ASC to crashes 0 Transparent: to the hardened process 0 Local: no process replication on multiple machines 0 Untrusted: can have faults while executing hardening
HARDENING RUNTIME
u
v
mout
Event handling
min
![Page 33: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/33.jpg)
Simple hardening
u
v
Original Replica
![Page 34: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/34.jpg)
Simple hardening
u
v
Fault diversity
Original Replica
![Page 35: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/35.jpg)
Simple hardening
u
v
mout
Event handling
Fault diversity
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
![Page 36: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/36.jpg)
Simple hardening
u
v
mout
Event handling
Fault diversity
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
![Page 37: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/37.jpg)
Simple hardening
u
v
mout
Event handling
Fault diversity
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
![Page 38: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/38.jpg)
Simple hardening
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
Error propagation!
![Page 39: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/39.jpg)
Simple hardening
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
Error propagation!
![Page 40: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/40.jpg)
Protecting computation
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
Redundant event handling
![Page 41: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/41.jpg)
Protecting computation
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
Redundant event handling
![Page 42: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/42.jpg)
Checking the checkers
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
![Page 43: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/43.jpg)
Checking the checkers
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
![Page 44: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/44.jpg)
Checking the checkers
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
Redundant event handling
CHECK
![Page 45: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/45.jpg)
Checking the checkers
u
v
mout
Event handling
min
Original Replica
Upon receive message <REQ, r> do!!if v > 5 then!! !u = r + v + 5;!!else!! !u = r + v;!!v = u;!!send <WRITE, v> to process p!
CHECK
Redundant event handling
![Page 46: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/46.jpg)
Issue with redundant checks
u
v
min
Original Replica
![Page 47: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/47.jpg)
Issue with redundant checks
u
v
mout
Event handling
min
Original Replica
![Page 48: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/48.jpg)
Issue with redundant checks
u
v
mout
Event handling
min
Original Replica
CHECK
Redundant event handling
![Page 49: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/49.jpg)
Issue with redundant checks
u
v
mout
Event handling
min
Original Replica
CHECK
Redundant event handling
![Page 50: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/50.jpg)
Incremental buffer
u
v
mout
Event handling
min
Original Replica
CHECK
Redundant event handling
![Page 51: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/51.jpg)
Control [low errors 0 A control [low error may subvert the execution 0 An event handler could be executed twice 0 Event handling may be skipped or incomplete
0 Requires control )low checks 0 Use [lags to control the control [low 0 Very lightweight
![Page 52: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/52.jpg)
PASC runtime
EH1 EH2 EH3
Process state
PASC checks
PASC library
User-‐ de[ined
Transparent
Replica state
![Page 53: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/53.jpg)
PASC runtime
EH1 EH2 EH3
Process state
PASC checks
PASC library
User-‐ de[ined
Transparent
Replica state
![Page 54: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/54.jpg)
PASC runtime
EH1 EH2 EH3
Process state
PASC checks
PASC library
User-‐ de[ined
Transparent
github.com/yahoo/pasc
Replica state
![Page 55: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/55.jpg)
Evaluation
![Page 56: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/56.jpg)
Hardening an echo server
0 Little computation, network bound, no overhead 0 PBFT is a reference (Nysiad not available)
0
0.5
1
1.5
2
2.5
3
3.5
4
0 20 40 60 80 100 120 140 160
Late
ncy in m
s
Throughput in Kops/s
PBFTPASC Echo
Unprot. Echo
![Page 57: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/57.jpg)
Hardening State Machine Replication
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140
Late
ncy
in m
s
Throughput in Kops/s
PBFTPASC Paxos
Unprot. Paxos
![Page 58: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/58.jpg)
Hardening State Machine Replication
+ 70 % -‐ 15 %
0
1
2
3
4
5
6
0 20 40 60 80 100 120 140
Late
ncy
in m
s
Throughput in Kops/s
PBFTPASC Paxos
Unprot. Paxos
![Page 59: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/59.jpg)
Zookeeper
0
5
10
15
20
25
0 10 20 30 40 50 60 70
La
ten
cy in
ms
Throughput in Kops/s
PASC ZooKeeperUnprot. ZooKeeper
![Page 60: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/60.jpg)
Memory overhead
0
200
400
600
800
1000
1200
1400
1600
1800
0 1K 4K
Mem
ory
usage in M
B
Request size in bytes
Unprot. PaxosPASC Paxos
![Page 61: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/61.jpg)
Scalability
0 SimpleKV: eventually consistent store, no replication 0 Scales similarly with hardening 0 No server “wasted” for replication
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7
Max
. th
roughput
(kops/
sec)
Number of servers
PASC sKV
Unprot. sKV
![Page 62: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/62.jpg)
PASC fault coverage 0 Injected random bit [lips in Paxos 0 Code corruptions: bytecode and binary code 0 State corruptions: pointers and primitive values
Code corruptions State corruptions Unprot PASC Unprot PASC
Undet. 3 0 93 0 Det. -‐ 1 -‐ 330 Crash 1640 1663 2301 2066 Total 2856 2856 5237 5237
![Page 63: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/63.jpg)
Conclusions 0 Hardware data corruptions occur in data centers 0 Model as Arbitrary State Corruptions 0 ASC-‐hardening algorithm for error isolation 0 Local: does not require replication
0 PASC: ASC-‐hardening library 0 Ef[icient: PASC-‐Paxos has up to 70% more throughput than PBFT
0 High fault coverage
![Page 64: PracticalHardening of(CrashTolerant(Systems( · 2020. 8. 22. · Dependabilityin(data(centers 0 Crashes(are(commonplace… 0 …but(scarier(faults(dooccur](https://reader033.vdocuments.us/reader033/viewer/2022060914/60a8440f48f3b5633d5d02cd/html5/thumbnails/64.jpg)
Thank you
sera)ini@yahoo-‐inc.com
github.com/yahoo/pasc