netflix development patterns for scale, performance & availability (dmg206) | aws re:invent 2013
DESCRIPTION
This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.TRANSCRIPT
![Page 1: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/1.jpg)
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability
Neil Hunt, Netflix
November 13, 2013
![Page 2: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/2.jpg)
Are You Designing Systems That Are: • Web-scale • Global • Highly-available • Consumer-facing
• Cloud Native
![Page 3: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/3.jpg)
Cloud Native • Service oriented architecture • Redundancy • Statelessness • NoSQL • Eventual consistency
![Page 4: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/4.jpg)
Assumptions
Slowly Changing Large Scale
Rapid Change Large Scale
Slowly Changing Small Scale
Rapid Change Small Scale
Speed
Sca
le
Everything works
Everything is Broken Hardware will fail
Software will fail
Enterprise IT Telcos
Startups Web-Scale
![Page 5: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/5.jpg)
Netflix Cloud Goals: Availability, Scale, Performance
![Page 6: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/6.jpg)
Performance • Reduce session start by 1s
Save 1 human lifetime per day! Win more moments of truth
• Suggest choices 1% better 500k hours/day additional value delivered
![Page 7: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/7.jpg)
Scale • 50% y/y traffic growth • 50 Countries, 3 continents • Tens of thousands of instances at peak • 4 AWS regions, 12 datacenters • ~$.001 per start
![Page 8: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/8.jpg)
Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter:
– Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M frustration, calls, lost business
![Page 9: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/9.jpg)
Availabilities Compound N Service Dependencies
Availability
2 .9998 10 .999 100 .99 1000 .9
99.99N%
99.99% 99.99% 99.99% …
N dependencies
![Page 10: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/10.jpg)
Availabilities Compound
99.9999% availability for each dependency
Isolation for independence
To achieve 99.99% availability with 1000 components
requires:
or
Component failure leads to degradation rather than
system failure
Component failure leads to system failure
![Page 11: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/11.jpg)
Availability, Scale, Performance Are Not Enough!
![Page 12: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/12.jpg)
Rapid Iteration – Rate of Change • Running tests • Rolling out tests
– Engineering the winning test experience for scale
• Adding features • Scaling up • Removing features, simplifying, minimizing
![Page 13: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/13.jpg)
Testing • Up to 1,000 changes per day!
![Page 14: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/14.jpg)
Rate of Change • Change leads to bugs
– New features – New configurations – New types of inputs – Scaling up
• Availability is in tension with rate of change
![Page 15: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/15.jpg)
Availability / Rate of Change Tradeoff
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
Frontier of availability/change
![Page 16: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/16.jpg)
Availability / Rate of Change Tradeoff
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
Frontier of availability/change
![Page 17: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/17.jpg)
Shifting the Curve…
1 10 100 1000
99.999%
99.99%
99.9%
99%
Rate of Change
Avai
labi
lity
![Page 18: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/18.jpg)
Shifting the Curve • Must break the chained dependencies
that compound in cascading system failure
• Subsystem isolation: – Failure in one component
should never result in cascading system failure
![Page 19: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/19.jpg)
Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network
• Latency monkey to
test
Dependent System
Dependence
Timeout
![Page 20: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/20.jpg)
Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network
• Latency monkey to
test
Dependent System
Dependence
Higher Tier System
Short timeout
Longer timeout
![Page 21: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/21.jpg)
Isolating Subsystems Timeout with fallback default response • Network failure • Software bug
Dependent System
Dependence
Timeout & Default response
{ status=mem, plan=4, device=true }
![Page 22: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/22.jpg)
Isolating Subsystems Canary Push • Network failure • Software bug
Dependent System
Dependence
Timeout
Canary instance new code
![Page 23: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/23.jpg)
Isolating Subsystems Red/Black deployment • Software bugs Dependent
System
Dependence V2.3
Bad code pushed Dependence
V2.2
Fail back to old code
![Page 24: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/24.jpg)
Isolating Subsystems Standby Blue system
• Independent
implementation • Simplified logic
Dependent System
Dependence V2.3
Static reference implementation
Fail to static version
![Page 25: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/25.jpg)
Isolating Subsystems Zone isolation • Infrastructure failure
(e.g. power outage)
• Chaos Gorilla
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
![Page 26: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/26.jpg)
Isolating Subsystems Region isolation • Infrastructure
software bugs (e.g. load balancer fail)
• Chaos Kong
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
Dependent System
Dependence
Zone A
Dependent System
Dependence
Zone B
Load Balancer
Region E Region W
DNS
![Page 27: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/27.jpg)
Isolating Subsystems
Dependency Mode Isolating Technique Instance Failure Network failure
Redundant systems with failover and timeout Timeout with default response
Network failure Software bug
Canary push Red-black deployment Blue systems
Infrastructure failure Zone isolation Cross-zone software bugs Region isolation
![Page 28: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/28.jpg)
Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential
problem
• Need to be great at execution AND Have the right architecture
• What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?
![Page 29: Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013](https://reader033.vdocuments.us/reader033/viewer/2022051514/54975e53b4795927538b465b/html5/thumbnails/29.jpg)
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
DMG206