maintaining the netflix front door - presentation at intuit meetup
DESCRIPTION
This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.TRANSCRIPT
Maintaining the Front Door to Netflix
Daniel Jacobson@daniel_jacobson
http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson
Global Streaming Videofor TV Shows and Movies
More than 48 Million Subscribers
More than 40 Countries
Netflix Accounts for >34% of Peak Downstream Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
Netflix Accounts for >6% of Peak Upstream Traffic in North America
Netflix subscribers are watching more than 1 billion hours a month
Team Focus:Build the Best Global Streaming Product
Three aspects of the Streaming Product:• Non-Member • Discovery• Streaming
The Netflix API - Background
Netflix API
Netflix API Requests by AudienceAt Launch In 2008
Netflix DevicesOpen API Developers
Netflix API
Netflix API Requests by AudienceFrom 2011
Netflix DevicesOpen API Developers
Current Emphasis of Netflix API
Netflix Devices
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Maintain a resilient front-door
• Scale the system
• Maintain high velocity
• Provide detailed insights into the system health
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Maintain a resilient front-door
• Scale the system
• Maintain high velocity
• Provide detailed insights into the system health
APIs DoLots of Things!
Data Gathering
Data Formatting
Data Delivery
Security
Authorization
Authentication
System Scaling
Discoverability
Data Consistency
Translations
Throttling
Orchestration
APIs DoLots of Things!
These are some of themany things APIs do.
Data Gathering
Data Formatting
Data Delivery
Security
Authorization
Authentication
System Scaling
Discoverability
Data Consistency
Translations
Throttling
Orchestration
APIs DoLots of Things!
These three are at the core.All others ultimately
support them.
Definitions
• Data Gathering– Retrieving the requested data from one or many local
or remote data sources
• Data Formatting– Preparing a structured payload to the requesting agent
• Data Delivery– Delivering the structured payload to the requesting
agent
Meanwhile…
There are two players in APIs
API Provider API Consumer
API Provider
PROVIDES
API Consumer
CONSUMES
Traditional API Interactions
API Provider
PROVIDES EVERYTHING
API ConsumerCONSUMES
WHAT IS PROVIDED
Everything means, API Provider does:• Data Gathering• Data Formatting• Data Delivery• (among other things)
Traditional API Interactions
Why do most API providers provide everything?
• API design tends to be easier for teams closer to the source
• Centralized API functions makes them easier to support
• Many APIs have a large set of unknown and external developers
Why do most API providers provide everything?
• API design tends to be easier for teams closer to the source
• Centralized API functions makes them easier to support
• Many APIs have a large set of unknown and external developers
At Netflix, we see it a different way…
Data Gathering Data Formatting Data Delivery
API Consumer
API Provider
Separation of Concerns
To be a better provider, the API should address the separation of concerns of the three core functions
Data Gathering Data Formatting Data Delivery
API ConsumerDon’t care how data is gathered, as long
as it is gathered
API ProviderCare a lot about how the data is
gathered
Separation of Concerns
Data Gathering Data Formatting Data Delivery
API ConsumerDon’t care how data is gathered, as long
as it is gathered
Each consumer cares a lot about the format for that specific use
API ProviderCare a lot about how the data is
gathered
Only cares about the format to the extent it
is easy to support
Separation of Concerns
Data Gathering Data Formatting Data Delivery
API ConsumerDon’t care how data is gathered, as long
as it is gathered
Each consumer cares a lot about the format for that specific use
Each consumer cares a lot about how payload
is delivered
API ProviderCare a lot about how the data is
gathered
Only cares about the format to the extent it
is easy to support
Only cares about delivery method to the
extent it is easy to support
Separation of Concerns
Because of our separation of concerns, the Netflix API team is
enabled to focus on different charters
Brokering Data to 1,000+ Device Types
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-AllAPI
Request
RequestRequest
Request
Request
Request
RequestRequest
Request
Request
RequestRequest
Request
Request
Request
Request
Courtesy of South Florida Classical Review
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title• /users/<id>/queues• /users/<id>/queues/instant• /users/<id>/recommendations• /catalog/titles/movie• /catalog/titles/series• /catalog/people
OSFA API
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
Network Border Network Border
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
OSFA API
Network Border Network Border
SERVER CODE
CLIENT CODE
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
OSFA API
Network Border Network Border
DATA GATHERING,FORMATTING,AND DELIVERY
USER INTERFACERENDERING
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMMENDATIONS
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
Groovy Layer
RECOMMENDATIONSA
ZXSXX C CCC
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
JAVA API
SERVER CODE
CLIENT CODE
CLIENT ADAPTER CODE(WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER)
Network Border Network Border
RECOMMENDATIONSA
ZXSXX C CCC
MOVIE DATA
SIMILAR MOVIES
AUTH MEMBERDATA
A/B TESTS
START-UP
RATINGS
JAVA API
DATA GATHERING
DATA FORMATTINGAND DELIVERY
USER INTERFACERENDERING
Network Border Network Border
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Maintain a resilient front-door
• Scale the system
• Maintain high velocity
• Provide detailed insights into the system health
1000+ Device Types
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies Reviews A/B Test
Engine
Dozens of Dependencies
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Dependency Relationships
2,000,000,000Incoming Requests Per Day
to the Netflix API
30Distinct Dependent
Services for the Netflix API
~500Dependency jars Slurped
into the Netflix API
14,000,000,000Netflix API Outbound Calls
Per Day to those Dependent Services
0Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime Per Month
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Circuit Breaker Dashboard
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Fallback
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Fallback
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Maintain a resilient front-door
• Scale the system
• Maintain high velocity
• Provide detailed insights into the system health
Netflix API : Requests Per Month
Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 -
5
10
15
20
25
30
35
Requ
ests
in B
illio
ns
50x growth in 18 months
AWS Cloud
Netflix API : Requests Per Month
Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 -
5
10
15
20
25
30
35
Requ
ests
in B
illio
ns
Autoscaling
Autoscaling
Scryer : Predictive Auto Scaling
Not yet…
Typical Traffic Patterns Over Five Days
Predicted RPS Compared to Actual RPS
Scaling Plan for Predicted Workload
What is Scryer Doing?
• Evaluating needs based on historical data– Week over week, month over month metrics
• Adjusts instance minimums based on algorithms
• Relies on Amazon Auto Scaling for unpredicted events
Results
Results : Load Average
ReactivePredictive
Results : Response Latencies
ReactivePredictive
Results : Outage Recovery
Results : AWS Costs
Scaling Globally
More than 48 Million Subscribers
More than 40 Countries
ZuulGatekeeper for the Netflix Streaming Application
Zuul *
• Multi-Region Resiliency
• Insights• Stress Testing• Canary Testing• Dynamic Routing
• Load Shedding• Security• Static Response
Handling• Authentication
* Most closely resembles an API proxy
All of these approaches are designed to prevent failures…
But sometimes the best way to prevent failures is to force them!
I randomly terminate instances
in production to identify dormant
failures.
Chaos Monkey
Chaos Gorilla
I simulate an outage of an
entire Amazon availability zone.
I simulate an outage in an AWS
region.
Chaos Kong
I find instances that don’t adhere to best practices.
Conformity Monkey
I extend Conformity Monkey to find
security violations.
Security Monkey
I detect unhealthy instances and remove them from service.
Doctor Monkey
I clean up the clutter and waste that runs in the
cloud.
Janitor Monkey
I induce artificial delays and errors into services to determine
how upstream services will respond.
Latency Monkey
Netflix API : Key Responsibilities
• Broker data between services and Devices
• Provide features and business logic
• Maintain a resilient front-door
• Scale the system
• Maintain high velocity
• Provide detailed insights into the system health
Personalization
EngineUser Info Movie
MetadataMovie Ratings
Similar Movies
API
Reviews A/B Test Engine
Dependency Relationships
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
Automated Delivery Pipeline
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from the Internet
Single Canary InstanceTo Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from the Internet
Canary Analysis Automation
Single Canary InstanceTo Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
API Requests from the Internet
Error!
Current Code
In Production
API Requests from the Internet
Current Code
In Production
API Requests from the Internet
Current Code
In Production
API Requests from the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from the Internet
New Code
Getting Prepared for Production
API Requests from the Internet
New Code
Getting Prepared for Production
https://www.github.com/Netflix
Maintaining the Front Door to Netflix
Daniel Jacobson@daniel_jacobson
http://www.linkedin.com/in/danieljacobsonhttp://www.slideshare.net/danieljacobson