the case for chaos

33
The Case for Chaos AWS Pop - up Loft Bruce Wong Engineering Manager Chaos Engineering, Netflix 1

Upload: bruce-wong

Post on 16-Jul-2015

268 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: The Case for Chaos

The Case for Chaos – AWS Pop-up Loft

Bruce Wong – Engineering Manager – Chaos Engineering, Netflix

1

Page 2: The Case for Chaos

Who am I?

Bruce Wong

2@bruce_m_wong

Page 3: The Case for Chaos

Who am I?

Bruce Wong

Netflix since 2010

3@bruce_m_wong

Page 4: The Case for Chaos

Who am I?

Bruce Wong

Netflix since 2010

Computer Science

4@bruce_m_wong

Page 5: The Case for Chaos

Who am I?

Bruce Wong

Netflix since 2010

Computer Science

Builds Engineering Teams

5 different teams so far

5@bruce_m_wong

Page 6: The Case for Chaos

Agenda

Why?

Case Studies

How you can start chaos testing

Future chaos

6@bruce_m_wong

Page 7: The Case for Chaos

Failure is Unavoidable

Disks Fail

Power outages. And your generator fails

Software bugs

Human Error

7@bruce_m_wong

Page 8: The Case for Chaos

What about the cloud?

8@bruce_m_wong

Page 9: The Case for Chaos

Cloud Case Study

9@bruce_m_wong

XSA-108 Security Vulnerability

~10% of EC2 instances

rebooted

Spread over a 5 days

One availability-zone at a time

Page 10: The Case for Chaos

Chaos Validated + Public Cloud Validated

10@bruce_m_wong

Page 11: The Case for Chaos

Netflix & Micro-Services

11@bruce_m_wong

http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Page 12: The Case for Chaos

Netflix & Micro-Services

12@bruce_m_wong

Page 13: The Case for Chaos

13@bruce_m_wong

Page 14: The Case for Chaos

14@bruce_m_wong

Page 15: The Case for Chaos

15@bruce_m_wong

Page 16: The Case for Chaos

16@bruce_m_wong

Page 17: The Case for Chaos

17

Graceful Degradation

@bruce_m_wong

Product + Engineering Decision

Page 18: The Case for Chaos

18

Designing for Failure

@bruce_m_wong

Infrastructure Failure

Instance terminations – single points of failure

Latency

Availability Zone

Regional

Application Failure

Graceful degradation

Software Bugs

Page 19: The Case for Chaos

19

Testing

@bruce_m_wong

Unit testing

Integration testing

Functional testing

Regression testing

Chaos Testing

Finding bugs earlier

Page 20: The Case for Chaos

20

Resilience needs to be tested

@bruce_m_wong

Testing is hard

Large and growing data sets

Internet-scale traffic

Innovation and New features

Change is constant

Page 21: The Case for Chaos

21

Resilience needs to be tested

@bruce_m_wong

Validate resilience design

Don’t wait for next outage

Un-controlled

Un-predictable

Hope is not a strategy

Page 22: The Case for Chaos

Types of Chaos

22

Instances Fail

Lessons

• Be as stateless as possible

• Autoscaling groups are good

• Invest in automation to rebuilt

state when necessary

• Running Chaos Monkey on

C*

@bruce_m_wong

Page 23: The Case for Chaos

Types of Chaos

23

Many Instances can Fail

Lessons

• Cassandra works as expected

• Moving Traffic back to steady

state is just as hard

• Infrastructure Management tools

can be a bottleneck

@bruce_m_wong

Page 24: The Case for Chaos

Types of Chaos

24

Natural Disasters Happen

Lessons

• Cassandra works as expected

• Moving Traffic back to steady

state is just as hard

• Infrastructure Management can

be a bottleneck

• Smaller Blast-Radius Benefits

• Traffic + Capacity orchestration

is hard

@bruce_m_wong

Page 25: The Case for Chaos

Types of Chaos

25

Latency

Still Learning

• Functional fallbacks don’t

account for system limitations

• Thread pools

• Connection pools

• Slow can be hard to find

• Slow can be hard to contain

• Unbounded Queues are BAD

@bruce_m_wong

Page 26: The Case for Chaos

26

Unbounded Queues

@bruce_m_wong

Come in many forms, to name a few

Threads

Memory

Disk

Bounded by physical limitations

VERY difficult to find

Elastic is not Infinite

Page 27: The Case for Chaos

27

For Example: Memory and Data

@bruce_m_wong

Data is important

In-Memory Queue grows and shrinks

Failure Mode # 1 – Out of memory

NOT A MEMORY LEAK!

Page 28: The Case for Chaos

28

For Example: Memory and Data

@bruce_m_wong

Data is important

If Queue gets to size X

Write to disk

Flush later

Failure Mode # 2

Disk Full

File Descriptors Saturated

Page 29: The Case for Chaos

29

For Example: Memory and Data

@bruce_m_wong

Data is important

But not as important as uptime

Page 30: The Case for Chaos

Starting Chaos

30

Start small, very small.

Start simple, stateless systems

Start manually and coordinated

Failure Injection Fridays

Build confidence

Outages are opportunities

@bruce_m_wong

Page 31: The Case for Chaos

Chaos takes time

31@bruce_m_wong

2010

2012

2014

Page 32: The Case for Chaos

Aspirational Chaos

32

Increase Frequency & Intensity

Reduces chance of drift

Infrastructure

Continuous Latency injection

Chaos Gorilla random AZ weekly

Latency Gorilla

CPU, Memory, Disk

Application

Continuous Validation of fallbacks

Startup dependency failure injection

@bruce_m_wong

Page 33: The Case for Chaos

Questions

33@bruce_m_wong