evolve or die - sigcommconferences.sigcomm.org/sigcomm/2016/files/program/...ramesh govindan, ina...
TRANSCRIPT
![Page 1: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/1.jpg)
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin
Vahdat… and a cast of hundreds at Google
![Page 2: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/2.jpg)
Network availability is the biggest challenge facing large content and
cloud providers today
2
![Page 3: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/3.jpg)
Why?
3
At four 9s availability❖ Outage budget is 4 mins per month
At five 9s availability❖ Outage budget is 24 seconds per month
The push towards higher 9s of availability
![Page 4: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/4.jpg)
4
By learning from failuresHow do providers achieve these levels?
![Page 5: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/5.jpg)
What design principles can achieve high availability?
What has Google Learnt from Failures?
Why is high network availability a challenge?
What are the characteristics of network availability failures?
5
Paper’s Focus
![Page 6: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/6.jpg)
Why is high network availability a challenge?
Velocity of EvolutionScale
Management Complexity
6
![Page 7: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/7.jpg)
Evolution
Time
Cap
acity
Saturn
Firehose 1.0
Watchtower
Firehose 1.1
4 Post
Jupiter
7
Network hardware evolves continuously
![Page 8: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/8.jpg)
Evolution
B4
2006
2008
2010
2012
2014Google Global Cache
BwE
JupitergRPC
Freedome
Watchtower
QUIC
Andromeda
8
So does network software
![Page 9: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/9.jpg)
Evolution
9
New hardware and software can❖ Introduce bugs❖ Disrupt existing software
Result: Failures!
![Page 10: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/10.jpg)
B2B4
Data centers
Other ISPs
Scale and Complexity
10
![Page 11: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/11.jpg)
Scale and Complexity
11
B4 and Data Centers❖ Use merchant silicon chips❖ Centralized control planes
Design Differences
B2❖ Vendor gear❖ Decentralized control plane
![Page 12: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/12.jpg)
Scale and Complexity
12
Design Differences
These differences increase management complexity and pose availability challenges
![Page 13: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/13.jpg)
The Management
Plane
Management Plane Software
13
Managesnetwork evolution
![Page 14: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/14.jpg)
Management Plane Operations
Connect a new data center to B2 and B4
Upgrade B4 or data center control plane software
Drain or undrain links, switches, routers, services
Many operations require multiple steps and can
take hours or days
Temporarily remove from service
14
![Page 15: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/15.jpg)
The Management
Plane
15
Low-level abstractions for management operations❖ Command-line interfaces to high
capacity routers
A small mistake by operator can impact a large part of network
![Page 16: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/16.jpg)
Why is high network availability a challenge?
What are the characteristics of network availability failures?
Duration, Severity, PrevalenceRoot-cause Categorization
16
![Page 17: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/17.jpg)
Key Takeaway
17
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
![Page 18: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/18.jpg)
We analyzed over 100 Post-mortem reports written over a
2 year period
18
![Page 19: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/19.jpg)
What is a Post-mortem?
Carefully curated description of a previously unseen failure that had significant availability impact
Helps learn from failures
19
Blame-free process
![Page 20: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/20.jpg)
What a Post-Mortem
Contains
20
Description of failure, with detailed timeline
Root-cause(s) confirmed by reproducing the failure
Discussion of fixes, follow up action items
![Page 21: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/21.jpg)
Failure Examples
and Impact
21
❖ Entire control plane fails❖ Upgrade causes backbone traffic shift❖Multiple top-of-rack switches fail
Examples
❖ Data center goes offline❖WAN capacity falls below demand❖ Several services fail concurrently
Impact
![Page 22: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/22.jpg)
Key Quantitative
Results
22
70% of failures occur when management plane operation is in progress
Failures are everywhere: all three networks and three planes see comparable failure rates
80% of failure durations between 10 and 100 minutes
Evolution impacts availability
No silver bullet
Need fast recovery
![Page 23: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/23.jpg)
Root causes
23
Lessons learned from root causes motivate availability design principles
![Page 24: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/24.jpg)
Why is high network availability a challenge?
What are the characteristics of network availability failures?
What design principles can achieve high availability?
Re-Think Management PlaneAvoid and Mitigate Large Failures
Evolve or Die
24
![Page 25: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/25.jpg)
25
Re-think the Management Plane
![Page 26: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/26.jpg)
Availability Principle
26
Operator types wrong CLI command, runs wrong script
Backbone router fails
Minimize Operator
Intervention
![Page 27: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/27.jpg)
Availability Principle
27
To upgrade part of a large device…❖ Line card, block of Clos fabric
… proceed while rest of device carries traffic❖ Enables higher availability
Necessary for upgrade-in-place
![Page 28: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/28.jpg)
Availability Principle
28
Ensure residual capacity > demand
Early risk assessments were manual
Risky!
High packet loss
Assess risk continuously
![Page 29: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/29.jpg)
Re-think the Management
Plane
I want to upgrade this router
“Intent”
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
29
![Page 30: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/30.jpg)
Re-think the Management
Plane
Management Plane Run-time
Management Operations
Device Configurations
Tests to Verify Operation
Apply Configuration
Perform management operation
Verify operation
AssessRisk
Continuously
Minimize Operator
Intervention
30
Automated Risk
Assessment
![Page 31: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/31.jpg)
31
Avoid and Mitigate Large Failures
![Page 32: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/32.jpg)
Availability Principle
32
B4 and data-centers have dedicated control-plane network❖ Failure of this can bring down entire control plane
Fail openContain failure radius
![Page 33: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/33.jpg)
Fail OpenCentralized
Control Plane
Preserve forwarding state of all switches❖ Fail-open the entire data center
33
Traffic
Exceedingly tricky!
Data center
![Page 34: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/34.jpg)
Availability Principle
34
A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center
Design fallback strategies
![Page 35: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/35.jpg)
Design Fallback Strategies
35
A large section of the WAN fails, so demand exceeds capacity
B4
![Page 36: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/36.jpg)
Design Fallback Strategies
36
B2
Fallback to B2!
Can shift largetraffic volumes from many data centers
B4
![Page 37: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/37.jpg)
Design Fallback
Strategies
37
When centralized traffic engineering fails...❖ … fallback to IP routing
Big Red Buttons❖ For every new software upgrade, design controls so
operator can initiate fallback to “safe” version
![Page 38: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/38.jpg)
38
Evolve or Die!
![Page 39: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/39.jpg)
39
We cannot treat a change to the network as an exceptional
event
![Page 40: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/40.jpg)
Evolve or Die
Make change the common case
Make it easy and safe to evolve the network daily
❖ Forces management automation❖ Permits small, verifiable changes
40
![Page 41: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/41.jpg)
Conclusion
41
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
![Page 42: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/42.jpg)
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Presentation template from SlidesCarnival
![Page 43: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/43.jpg)
43
Older Slides
![Page 44: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/44.jpg)
Popular root-cause
categories
44
Cabling error, interface card failure, cable cut….
![Page 45: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/45.jpg)
Popular root-cause
categories
45
Operator types wrong CLI command, runs wrong script
![Page 46: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/46.jpg)
Popular root-cause
categories
46
Incorrect demand or capacity estimation for upgrade-in-place
![Page 47: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/47.jpg)
Upgrade in place
47
![Page 48: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/48.jpg)
Assessing Risk Correctly
Residual Capacity? Demand?
Varies by interconnect Can change dynamically
48
![Page 49: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/49.jpg)
Popular root-cause
categories
49
Hardware or link layer failures in control plane network
![Page 50: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/50.jpg)
Popular root-cause
categories
50
Two control plane components have inconsistent views of control plane state, caused by bug
![Page 51: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/51.jpg)
Popular root-cause
categories
51
Running out of memory, CPU, OS resources (threads)...
![Page 52: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/52.jpg)
Lessons from Failures
The role of evolution in failures▸ Rethink the
Management Plane
The prevalence of large, severe, failures▸ Prevent and
mitigate large failures
Long failure durations▸ Recover fast
52
![Page 53: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/53.jpg)
High-level Management
Plane Abstractions
I want to upgrade this router
Why is this difficult? Modern high capacity routers:❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which
have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 100s of thousands of lines
“Intent”
53
![Page 54: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/54.jpg)
High-level Management
Plane Abstractions
I want to upgrade this router
“Intent”
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
54
![Page 55: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/55.jpg)
Management Plane
Automation
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
Apply Configuration
Perform management operation
Verify operation
AssessRisk
Continuously
Minimize Operator
Intervention
55
![Page 56: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/56.jpg)
Large Control
Plane Failures
Centralized Control Plane
56
![Page 57: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/57.jpg)
Contain the blast radiusCentralized
Control Plane
57
Centralized Control Plane
Smaller failure impact, but increased complexity
![Page 58: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/58.jpg)
Fail-OpenCentralized
Control Plane
Preserve forwarding state of all switches❖ Fail-open the entire fabric
58
![Page 59: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/59.jpg)
Defensive Control-Plane
Design
Gateway
Topology Modeler
TE Server
BwE
59
One piece of this large update
seems wrong!!
![Page 60: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/60.jpg)
Trust but Verify
Gateway
Topology Modeler
TE Server
BwE
60
Let me check the correctness of the update...
![Page 61: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/61.jpg)
Fallback to B2
Gateway
Topology Modeler
TE Server
BwE
61
B2
![Page 62: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/62.jpg)
Mitigating Large Failures
Design Fallback Strategies▸ B4 B2▸ Tunneling IP routing▸ Big Red Buttons
62
![Page 63: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/63.jpg)
Continuously Monitor
Invariants
63
Must have onefunctional backup
SDN controller
Anycast route must have AS path length
of 3
Data center must peer with two B2
routers
![Page 64: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/64.jpg)
This Alone isn’t Enough...
64
![Page 65: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/65.jpg)
65
We cannot treat a change to the network as an exceptional
event
![Page 66: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/66.jpg)
Evolve or Die
Make change the common case
Make it easy and safe to evolve the network daily
❖ Forces management automation❖ Permits small, verifiable changes
66
![Page 67: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/67.jpg)
Key Takeaway
67
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
![Page 68: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/68.jpg)
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin
Vahdat… and a cast of hundreds at Google
Presentation template from SlidesCarnival
![Page 69: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/69.jpg)
Impact of Availability
Failures
69
![Page 70: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/70.jpg)
What design principles can achieve high availability?
A Case Study: Google
Why is high network availability a challenge?
What are the characteristics of network availability failures?
70
![Page 71: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/71.jpg)
The velocity of evolution is fueled by
traffic growth...
71
![Page 72: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/72.jpg)
… and by an increase in
product and service
offerings
72
![Page 73: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/73.jpg)
Networks have very different designs
Different hardware Different control planes
Different forwarding paradigms
These differences increase management and evolution complexity
73
![Page 74: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/74.jpg)
❖ Fabrics with merchant silicon chips❖ Centralized control plane❖ Out of band control plane network
Data centers
Control plane
network
74
SIGCOMM 2015
![Page 75: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/75.jpg)
B4
Gateway
Topology Modeler
TE Server
BwE
❖ B4 routers built using merchant silicon chips❖ Centralized control plane within each B4 site❖ Centralized traffic engineering❖ Bandwidth enforcement for traffic metering
75
SIGCOMM 2015
SIGCOMM 2013
![Page 76: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/76.jpg)
Other ISPs
❖ B2 routers based on vendor gear❖ Decentralized routing and MPLS TE❖ Class of service (high/low) using MPLS priorities
B2
76
![Page 77: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/77.jpg)
The Management
Plane
Low-level, per device, abstractions for
management operations77
![Page 78: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/78.jpg)
Where do failures
happen?
No network or plane that dominates
78
![Page 79: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/79.jpg)
How long do the failures
last?
Durations much longer than outage budgets
Shorter failures on B2
79
![Page 80: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/80.jpg)
What role does
evolution play?
70% of failures happen when a management operation is in progress
80
![Page 81: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/81.jpg)
Where do failures
happen?
12
326
10
8 5
14
6
Control plane
network
12
8
15
81
![Page 82: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/82.jpg)
Failures are everywhere
82
![Page 83: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/83.jpg)
Across networks
All three
All three
All three
All three
All three
83
![Page 84: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/84.jpg)
Across planes
Data
Management
Data
Data
Control
Management
84
Management
![Page 85: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/85.jpg)
Root-Cause Categorization
What are the root causes for these
failures?
85
![Page 86: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/86.jpg)
Rethink the Management
Plane
Low-level network managementcannot ensure high availability
86
![Page 87: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/87.jpg)
Re-think the Management
Plane
I want to upgrade this router
Lots of complexity hidden below this statement❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which
have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 1000s of thousands of lines
“Intent”
87
![Page 88: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/88.jpg)
Contain failure radiusCentralized
Control Plane
88
Centralized Control Plane
Each partition managed by different control plane
Adds design complexity
Even if one partition fails, others can carry traffic
![Page 89: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/89.jpg)
Key Takeaway
89
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
![Page 90: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/90.jpg)
By learning from failures
90
![Page 91: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/91.jpg)
What design principles can achieve high availability?▸ Lessons
learned from root-causes
What has Google Learnt from Failures?
Why is high network availability a challenge?▸ Factors that
impact availability
What are the characteristics of network failures?▸ Severity,
duration, prevalence
▸ Root-cause categorization
91
![Page 92: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/92.jpg)
DataCenter
Data Center
DataCenter
In a global networkFailures are common Configuration can change
These can impact network availability92
![Page 93: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/93.jpg)
How long does it take...
10s of minutes to hours Hours to days
DataCenter
Data Center
DataCenter
… to root-cause a failure … to upgrade part of the network
93
![Page 94: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/94.jpg)
Outage budgets...
… for four 9s availability? … for five 9s availability?
4 minutes per month 24 seconds per month
99.99% uptime 99.999% uptime
94
![Page 95: Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat ... Session02-Paper01-Evolve-Ramesh-Slides](https://reader034.vdocuments.us/reader034/viewer/2022051511/601d5a208b73cc216576655b/html5/thumbnails/95.jpg)
To move towards higher availability targets, it is important to learn from
failures
95