twilio signal 2016 chaos patterns
TRANSCRIPT
a
CHAOS PATTERNS
BRUCE M. WONG | @BRUCE_M_WONG
LESSONS ABOUT FAILING WELL AND FAILING OFTEN
FAILURE HAPPENS
BRUCE M. WONG | @BRUCE_M_WONG
“EVERYTHING FAILS ALL THE TIME”-WERNER VOGELS, CTO, AMAZON WEB SERVICES
HTTP://THENEXTWEB.COM/2008/04/04/WERNER-VOGELS-EVERYTHING-FAILS-ALL-THE-TIME/
BRUCE M. WONG | @BRUCE_M_WONG
THE ORIGINAL CHAOS MONKEY
CREATED BY NETFLIX CLOUD ARCHITECT, GREG ORZELL - @CHAOSSIMIA 2010
BRUCE M. WONG | @BRUCE_M_WONG
HTTPS://WWW.LINKEDIN.COM/IN/GORZELL
a
A STATE OF XENAWS EC2 REBOOT, 2014
BRUCE M. WONG | @BRUCE_M_WONG
HTTP://XENBITS.XEN.ORG/XSA/ADVISORY-108.HTML
HTTP://TECHBLOG.NETFLIX.COM/2014/10/A-STATE-OF-XEN-CHAOS-MONKEY-CASSANDRA.HTML
HTTP://AWS.AMAZON.COM/BLOGS/AWS/EC2-MAINTENANCE-UPDATE/
22 COMPLETE NODE FAILURE
2700+ C* NODES, 218 REBOOTS
0 DOWNTIME
BRUCE M. WONG | @BRUCE_M_WONG
LESSON #1 : TRUST YOUR RESILIENCE
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
BRUCE M. WONG | @BRUCE_M_WONG
UNBOUND QUEUES - ELASTIC ISN’T INFINITE
BRUCE M. WONG | @BRUCE_M_WONG
UNBOUND QUEUES - ELASTIC ISN’T INFINITE
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
BRUCE M. WONG | @BRUCE_M_WONG
LATENCY MONKEY
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
BRUCE M. WONG | @BRUCE_M_WONG
LATENCY TESTING 2.0 - FIT
HTTP://TECHBLOG.NETFLIX.COM/2014/10/FIT-FAILURE-INJECTION-TESTING.HTML
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
START SLOW
•ACCOUNT LEVEL •+10MS BEFORE +100MS •+1% ERRORS BEFORE +80% ERRORS
DIAL IT UP •A -> D NOT * -> D
BRUCE M. WONG | @BRUCE_M_WONG
LESSON # 2 : FIXING ONE FAILURE MODE EXPOSES NEW ONES
BRUCE M. WONG | @BRUCE_M_WONG
WHATS SO SPECIAL ABOUT CHAOS
BRUCE M. WONG | @BRUCE_M_WONG
CHAOS IS A CHOICE
WHATS SO SPECIAL ABOUT CHAOS
BRUCE M. WONG | @BRUCE_M_WONG
OUTAGES VS CHAOS
BRUCE M. WONG | @BRUCE_M_WONG
OUTAGES VS CHAOSUncontrolled Controlled
Unpredictable Scheduled
Time to Detect: Minutes 0 Time to Detect
Time to Resolve: ???? Time to Resolve: seconds*
Analysis Time: ???? Root Cause Analysis: Intentional
MYTH OF RESILIENCE
NATION’S BUSINESS, 1977
BRUCE M. WONG | @BRUCE_M_WONG
LATENCY MONKEY
BRUCE M. WONG | @BRUCE_M_WONG
LESSON # 3 : THE CULTURE ASPECTS OF CHAOS ARE HARD
BRUCE M. WONG | @BRUCE_M_WONG
BRUCE M. WONG | @BRUCE_M_WONG
MOST ENTERPRISES HIRE PEOPLE TO FIX THINGS. NETFLIX HIRES PEOPLE TO BREAK THINGS….
…WE SHOULD EMBRACE NETFLIX'S CULTURE OF "CHAOS ENGINEERING" THROUGHOUT ORGANIZATIONS OF ALL SHAPES AND SIZES.
BRUCE M. WONG | @BRUCE_M_WONG
SEEK PROGRESS OVER PERFECTIONTWILIO LEADERSHIP PRINCIPLE
BRUCE M. WONG | @BRUCE_M_WONG
GAME DAYS - BENEFITS
•Training New Engineers
•Discover Instrumentation gaps
•New Product Launches
•Incident Management Practices
BRUCE M. WONG | @BRUCE_M_WONG
GAME DAYS - THE SETUP
•Two “on-call” teams
•Separate rooms, separate slack channels
•Master of Disaster
•Incident Commander
BRUCE M. WONG | @BRUCE_M_WONG
LEVERAGE EXISTING TESTBOTS
•Functionally test fallback code
•Early warning!
•Existing Integrations with Telemetry, PagerDuty, Slack
•Incorporate into Canary processFUTURE
BRUCE M. WONG | @BRUCE_M_WONG
RECAP
Lesson # 1 : Trust your resilience
Lesson # 2 : Fixing one failure mode exposes new ones
Lesson # 3 : The culture aspects of Chaos are HARD
Get started today!
Game Days are your friend - do them early and often
Testbots + focus on developer productivity
BRUCE M. WONG | @BRUCE_M_WONG
WHEN YOU WISH UPON A BLUE MOON
BRUCE M. WONG | @BRUCE_M_WONG