cloud native architecture at netflix
TRANSCRIPT
![Page 1: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/1.jpg)
Cloud Native Architecture at Netflix
Yow December 2013 (Brisbane)
Adrian Cockcroft @adrianco @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
![Page 2: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/2.jpg)
Netflix History (current size)
• 1998 DVD Shipping Service in USA (~7M users)
• 2007 Streaming video in USA (~31M users)
• International streaming video (~9M users)
– 2010 Canada
– 2011 Latin America
– 2012 UK and Ireland
– 2012 Nordics
– 2013 Netherlands
![Page 3: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/3.jpg)
Netflix Member Web Site Home Page Personalization Driven – How Does It Work?
![Page 4: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/4.jpg)
How Netflix Used to Work
Customer Device (PC, PS3, TV…)
Monolithic Web App
Oracle
MySQL
Monolithic Streaming App
Oracle
MySQL
Limelight/Level 3 Akamai CDNs
Content Management
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
![Page 5: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/5.jpg)
How Netflix Streaming Works Today
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Datacenter
![Page 6: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/6.jpg)
![Page 7: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/7.jpg)
Netflix Scale
• Tens of thousands of instances on AWS
– Typically 4 core, 30GByte, Java business logic
– Thousands created/removed every day
• Thousands of Cassandra NoSQL nodes on AWS
– Many hi1.4xl - 8 core, 60Gbyte, 2TByte of SSD
– 65 different clusters, over 300TB data, triple zone
– Over 40 are multi-region clusters (6, 9 or 12 zone)
– Biggest 288 m2.4xl – over 300K rps, 1.3M wps
![Page 8: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/8.jpg)
Reactions over time
2009 “You guys are crazy! Can’t believe it” 2010 “What Netflix is doing won’t work” 2011 “It only works for ‘Unicorns’ like Netflix” 2012 “We’d like to do that but can’t” 2013 “We’re on our way using Netflix OSS code”
![Page 9: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/9.jpg)
YOW! Workshop
175 slides of Netflix Architecture
See bit.ly/netflix-workshop
A whole day…
![Page 10: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/10.jpg)
This Talk
Abstract the principles from the architecture
![Page 11: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/11.jpg)
Objectives:
Scalability
Availability
Agility
Efficiency
![Page 12: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/12.jpg)
Principles:
Immutability
Separation of Concerns
Anti-fragility
High trust organization
Sharing
![Page 13: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/13.jpg)
Outcomes:
• Public cloud – scalability, agility, sharing
• Micro-services – separation of concerns
• De-normalized data – separation of concerns
• Chaos Engines – anti-fragile operations
• Open source by default – agility, sharing
• Continuous deployment – agility, immutability
• DevOps – high trust organization, sharing
• Run-what-you-wrote – anti-fragile development
![Page 14: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/14.jpg)
When to use public cloud?
![Page 15: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/15.jpg)
![Page 16: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/16.jpg)
"This is the IT swamp draining manual for anyone who is neck deep in alligators."- Adrian Cockcroft, Cloud Architect at Netflix
![Page 17: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/17.jpg)
Goal of Traditional IT: Reliable hardware
running stable software
![Page 18: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/18.jpg)
SCALE Breaks hardware
![Page 19: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/19.jpg)
….SPEED Breaks software
![Page 20: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/20.jpg)
SPEED at SCALE
Breaks everything
![Page 21: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/21.jpg)
![Page 22: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/22.jpg)
Incidents – Impact and Mitigation
PR
X Incidents
CS
XX Incidents
Metrics impact – Feature disable
XXX Incidents
No Impact – fast retry or automated failover
XXXX Incidents
Public Relations Media Impact
High Customer Service Calls
Affects AB Test Results
Y incidents mitigated by Active Active, game day practicing
YY incidents mitigated by
better tools and practices
YYY incidents mitigated by better
data tagging
![Page 23: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/23.jpg)
Web Scale Architecture
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS DynECT
DNS
AWS Route53
DNS Automation
![Page 24: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/24.jpg)
Colonel Boyd, USAF
“Get inside your adversaries' OODA loop to disorient them”
![Page 25: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/25.jpg)
“Agile” vs. “Continuous Delivery”
Speed Wins
![Page 26: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/26.jpg)
Observe
Orient
Decide
Act
Territory Expansion Competitive
Moves
Customer Pain Point
Data Warehouse
Business Buy-in
2 Week Plan
Feature Priority
Code Feature
Install Capacity
Web Display Ads
Capacity Estimate
Measure Sales
2 week agile train/sprint
![Page 27: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/27.jpg)
2 Week Train Model Hand-Off Steps
Product Manager – 2 days
Developer – 2 days coding, 2 days meetings
QA Integration Team – 3 days
Operations Deploy Team – 4 days
BI Analytics Team – 1 day
![Page 28: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/28.jpg)
What’s Next?
Increase rate of change
Reduce cost and size and
risk of change
![Page 29: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/29.jpg)
Cloud Native
Construct a highly agile and highly available service from ephemeral and
assumed broken components
![Page 30: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/30.jpg)
Cloud Native Microservices
Start Here
memcached
Cassandra
Web service
S3 bucket
Each icon is three to a few hundred instances across three AWS zones
Each microservice is updated independently and continuously at whatever rate seems appropriate
![Page 31: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/31.jpg)
Continuous Deployment
No time for handoff to IT
![Page 32: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/32.jpg)
Developer Self Service
Freedom and Responsibility
![Page 33: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/33.jpg)
Developers run what they wrote
Root access and pagerduty
![Page 34: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/34.jpg)
IT is a Cloud API
DEVops automation
![Page 35: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/35.jpg)
Github all the things!
Leverage social coding
![Page 36: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/36.jpg)
Netflix.github.com
35 repos 36 repos 37 today
![Page 37: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/37.jpg)
Karyon Microservice Template
• Bootstrapping o Dependency & Lifecycle management via Governator.
o Service registry via Eureka.
o Property management via Archaius
o Hooks for Latency Monkey injection testing
o Preconfigured status page and heathcheck servlets
![Page 38: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/38.jpg)
o Eureka discovery service metadata
o Environment variables
o JMX
o Versions
o Conformity Monkey Support
Karyon Microservice Status Page
![Page 39: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/39.jpg)
Sample Application – RSS Reader
![Page 40: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/40.jpg)
Putting it all together…
![Page 41: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/41.jpg)
Observe
Orient
Decide
Act
Land grab opportunity Competitive
Move
Customer Pain Point
Analysis
JFDI
Plan Response
Share Plans
Increment Implement
Automatic Deploy
Launch AB Test
Model Hypotheses
Measure Customers
Continuous Delivery on
Cloud
![Page 42: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/42.jpg)
Continuous Deploy Hand-Off
Product Manager - 2 days
A/B test setup and enable
Self service hypothesis test results
Developer – 2 days
Automated test
Automated deploy, on call
Self service analytics
![Page 43: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/43.jpg)
Continuous Deploy Automation
Check in code, Jenkins build
Bake AMI, launch in test env
Functional and performance test
Production canary test
Production red/black push
![Page 44: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/44.jpg)
Bad Canary Signature
![Page 45: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/45.jpg)
Happy Canary Signature
![Page 46: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/46.jpg)
Global Deploy Automation
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
West Coast Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
East Coast Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Europe Load Balancers
Afternoon in California
Night-time in Europe
Next day on East Coast Next day on West Coast After peak in Europe
If passes test suite, canary then deploy
Canary then deploy Canary then deploy
![Page 47: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/47.jpg)
Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours Push
Autoscale Up Autoscale Down
![Page 48: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/48.jpg)
Scryer - Predictive Auto-scaling See techblog.netflix.com
More morning load Sat/Sun high traffic
Lower load on Weds 24 Hours predicted traffic vs. actual
FFT based prediction driving AWS Autoscaler to plan minimum capacity
![Page 49: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/49.jpg)
Suro Event Pipeline
1.5 Million events/s 80 Billion events/day
Cloud native, dynamic, configurable offline and realtime data sinks
Error rate alerting
![Page 50: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/50.jpg)
Inspiration
![Page 51: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/51.jpg)
Principles Revisited:
Immutability
Separation of Concerns
Anti-fragility
High trust organization
Sharing
![Page 52: Cloud Native Architecture at Netflix](https://reader034.vdocuments.us/reader034/viewer/2022042619/586b7fe51a28ab2a738be996/html5/thumbnails/52.jpg)
Takeaway
Speed Wins
Assume Broken
Cloud Native Automation
Github is the “app store” and resumé
@adrianco @NetflixOSS
http://netflix.github.com