the case for chaos

Post on 16-Jul-2015

268 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Case for Chaos – AWS Pop-up Loft

Bruce Wong – Engineering Manager – Chaos Engineering, Netflix

1

Who am I?

Bruce Wong

2@bruce_m_wong

Who am I?

Bruce Wong

Netflix since 2010

3@bruce_m_wong

Who am I?

Bruce Wong

Netflix since 2010

Computer Science

4@bruce_m_wong

Who am I?

Bruce Wong

Netflix since 2010

Computer Science

Builds Engineering Teams

5 different teams so far

5@bruce_m_wong

Agenda

Why?

Case Studies

How you can start chaos testing

Future chaos

6@bruce_m_wong

Failure is Unavoidable

Disks Fail

Power outages. And your generator fails

Software bugs

Human Error

7@bruce_m_wong

What about the cloud?

8@bruce_m_wong

Cloud Case Study

9@bruce_m_wong

XSA-108 Security Vulnerability

~10% of EC2 instances

rebooted

Spread over a 5 days

One availability-zone at a time

Chaos Validated + Public Cloud Validated

10@bruce_m_wong

Netflix & Micro-Services

11@bruce_m_wong

http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Netflix & Micro-Services

12@bruce_m_wong

13@bruce_m_wong

14@bruce_m_wong

15@bruce_m_wong

16@bruce_m_wong

17

Graceful Degradation

@bruce_m_wong

Product + Engineering Decision

18

Designing for Failure

@bruce_m_wong

Infrastructure Failure

Instance terminations – single points of failure

Latency

Availability Zone

Regional

Application Failure

Graceful degradation

Software Bugs

19

Testing

@bruce_m_wong

Unit testing

Integration testing

Functional testing

Regression testing

Chaos Testing

Finding bugs earlier

20

Resilience needs to be tested

@bruce_m_wong

Testing is hard

Large and growing data sets

Internet-scale traffic

Innovation and New features

Change is constant

21

Resilience needs to be tested

@bruce_m_wong

Validate resilience design

Don’t wait for next outage

Un-controlled

Un-predictable

Hope is not a strategy

Types of Chaos

22

Instances Fail

Lessons

• Be as stateless as possible

• Autoscaling groups are good

• Invest in automation to rebuilt

state when necessary

• Running Chaos Monkey on

C*

@bruce_m_wong

Types of Chaos

23

Many Instances can Fail

Lessons

• Cassandra works as expected

• Moving Traffic back to steady

state is just as hard

• Infrastructure Management tools

can be a bottleneck

@bruce_m_wong

Types of Chaos

24

Natural Disasters Happen

Lessons

• Cassandra works as expected

• Moving Traffic back to steady

state is just as hard

• Infrastructure Management can

be a bottleneck

• Smaller Blast-Radius Benefits

• Traffic + Capacity orchestration

is hard

@bruce_m_wong

Types of Chaos

25

Latency

Still Learning

• Functional fallbacks don’t

account for system limitations

• Thread pools

• Connection pools

• Slow can be hard to find

• Slow can be hard to contain

• Unbounded Queues are BAD

@bruce_m_wong

26

Unbounded Queues

@bruce_m_wong

Come in many forms, to name a few

Threads

Memory

Disk

Bounded by physical limitations

VERY difficult to find

Elastic is not Infinite

27

For Example: Memory and Data

@bruce_m_wong

Data is important

In-Memory Queue grows and shrinks

Failure Mode # 1 – Out of memory

NOT A MEMORY LEAK!

28

For Example: Memory and Data

@bruce_m_wong

Data is important

If Queue gets to size X

Write to disk

Flush later

Failure Mode # 2

Disk Full

File Descriptors Saturated

29

For Example: Memory and Data

@bruce_m_wong

Data is important

But not as important as uptime

Starting Chaos

30

Start small, very small.

Start simple, stateless systems

Start manually and coordinated

Failure Injection Fridays

Build confidence

Outages are opportunities

@bruce_m_wong

Chaos takes time

31@bruce_m_wong

2010

2012

2014

Aspirational Chaos

32

Increase Frequency & Intensity

Reduces chance of drift

Infrastructure

Continuous Latency injection

Chaos Gorilla random AZ weekly

Latency Gorilla

CPU, Memory, Disk

Application

Continuous Validation of fallbacks

Startup dependency failure injection

@bruce_m_wong

Questions

33@bruce_m_wong

top related