the case for chaos
TRANSCRIPT
![Page 1: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/1.jpg)
The Case for Chaos – AWS Pop-up Loft
Bruce Wong – Engineering Manager – Chaos Engineering, Netflix
1
![Page 2: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/2.jpg)
Who am I?
Bruce Wong
2@bruce_m_wong
![Page 3: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/3.jpg)
Who am I?
Bruce Wong
Netflix since 2010
3@bruce_m_wong
![Page 4: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/4.jpg)
Who am I?
Bruce Wong
Netflix since 2010
Computer Science
4@bruce_m_wong
![Page 5: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/5.jpg)
Who am I?
Bruce Wong
Netflix since 2010
Computer Science
Builds Engineering Teams
5 different teams so far
5@bruce_m_wong
![Page 6: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/6.jpg)
Agenda
Why?
Case Studies
How you can start chaos testing
Future chaos
6@bruce_m_wong
![Page 7: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/7.jpg)
Failure is Unavoidable
Disks Fail
Power outages. And your generator fails
Software bugs
Human Error
7@bruce_m_wong
![Page 8: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/8.jpg)
What about the cloud?
8@bruce_m_wong
![Page 9: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/9.jpg)
Cloud Case Study
9@bruce_m_wong
XSA-108 Security Vulnerability
~10% of EC2 instances
rebooted
Spread over a 5 days
One availability-zone at a time
![Page 10: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/10.jpg)
Chaos Validated + Public Cloud Validated
10@bruce_m_wong
![Page 11: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/11.jpg)
Netflix & Micro-Services
11@bruce_m_wong
http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
![Page 12: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/12.jpg)
Netflix & Micro-Services
12@bruce_m_wong
![Page 13: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/13.jpg)
13@bruce_m_wong
![Page 14: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/14.jpg)
14@bruce_m_wong
![Page 15: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/15.jpg)
15@bruce_m_wong
![Page 16: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/16.jpg)
16@bruce_m_wong
![Page 17: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/17.jpg)
17
Graceful Degradation
@bruce_m_wong
Product + Engineering Decision
![Page 18: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/18.jpg)
18
Designing for Failure
@bruce_m_wong
Infrastructure Failure
Instance terminations – single points of failure
Latency
Availability Zone
Regional
Application Failure
Graceful degradation
Software Bugs
![Page 19: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/19.jpg)
19
Testing
@bruce_m_wong
Unit testing
Integration testing
Functional testing
Regression testing
Chaos Testing
Finding bugs earlier
![Page 20: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/20.jpg)
20
Resilience needs to be tested
@bruce_m_wong
Testing is hard
Large and growing data sets
Internet-scale traffic
Innovation and New features
Change is constant
![Page 21: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/21.jpg)
21
Resilience needs to be tested
@bruce_m_wong
Validate resilience design
Don’t wait for next outage
Un-controlled
Un-predictable
Hope is not a strategy
![Page 22: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/22.jpg)
Types of Chaos
22
Instances Fail
Lessons
• Be as stateless as possible
• Autoscaling groups are good
• Invest in automation to rebuilt
state when necessary
• Running Chaos Monkey on
C*
@bruce_m_wong
![Page 23: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/23.jpg)
Types of Chaos
23
Many Instances can Fail
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management tools
can be a bottleneck
@bruce_m_wong
![Page 24: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/24.jpg)
Types of Chaos
24
Natural Disasters Happen
Lessons
• Cassandra works as expected
• Moving Traffic back to steady
state is just as hard
• Infrastructure Management can
be a bottleneck
• Smaller Blast-Radius Benefits
• Traffic + Capacity orchestration
is hard
@bruce_m_wong
![Page 25: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/25.jpg)
Types of Chaos
25
Latency
Still Learning
• Functional fallbacks don’t
account for system limitations
• Thread pools
• Connection pools
• Slow can be hard to find
• Slow can be hard to contain
• Unbounded Queues are BAD
@bruce_m_wong
![Page 26: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/26.jpg)
26
Unbounded Queues
@bruce_m_wong
Come in many forms, to name a few
Threads
Memory
Disk
Bounded by physical limitations
VERY difficult to find
Elastic is not Infinite
![Page 27: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/27.jpg)
27
For Example: Memory and Data
@bruce_m_wong
Data is important
In-Memory Queue grows and shrinks
Failure Mode # 1 – Out of memory
NOT A MEMORY LEAK!
![Page 28: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/28.jpg)
28
For Example: Memory and Data
@bruce_m_wong
Data is important
If Queue gets to size X
Write to disk
Flush later
Failure Mode # 2
Disk Full
File Descriptors Saturated
![Page 29: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/29.jpg)
29
For Example: Memory and Data
@bruce_m_wong
Data is important
…
But not as important as uptime
![Page 30: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/30.jpg)
Starting Chaos
30
Start small, very small.
Start simple, stateless systems
Start manually and coordinated
Failure Injection Fridays
Build confidence
Outages are opportunities
@bruce_m_wong
![Page 31: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/31.jpg)
Chaos takes time
31@bruce_m_wong
2010
2012
2014
![Page 32: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/32.jpg)
Aspirational Chaos
32
Increase Frequency & Intensity
Reduces chance of drift
Infrastructure
Continuous Latency injection
Chaos Gorilla random AZ weekly
Latency Gorilla
CPU, Memory, Disk
Application
Continuous Validation of fallbacks
Startup dependency failure injection
@bruce_m_wong
![Page 33: The Case for Chaos](https://reader031.vdocuments.us/reader031/viewer/2022020123/55a6a86d1a28ab056b8b4574/html5/thumbnails/33.jpg)
Questions
33@bruce_m_wong