gluecon 2013 netflix api crash course

Post on 05-Dec-2014

2.971 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation from Gluecon 2013 on building and running the Netflix API.

TRANSCRIPT

Netflix API Crash CourseBuilding & Running the API in 30 minutes

Ben Schmaus, NetflixMay 2013, Gluecon

bschmaus@netflix.com@schmaus

Streaming TV Shows & Movies Globally

> 1000 Devices

1/3 ofInternet at peak

Programmer not Distributor

More than 36 million subscribers in over

40 countries

How does the API fit into the picture?

PersonalizationEngine User Info Movie

Metadata Ratings SimilarMovies

InstantQueue

A/B TestEngine

API

PersonalizationEngine User Info Movie

Metadata Ratings SimilarMovies

InstantQueue

A/B TestEngine

APIEnable UX Innovation

Insulate from Failure

> 2 Billion Requests per Day

Growth Over Time

Automation

Visibility

Operational awareness

Balance speed& quality

How's the APIput together?

ELB RoutingCluster

Mid-tier Services

Backend App

Cluster

Backend App

Cluster

+

API Layer

ELB RoutingCluster

Mid-tier Services

Backend App

Cluster

Backend App

Cluster

+

API Layer

Inside an API

App Server

RxJava

Hystrix

Service Client 1 Service Client 2 Service Client N

HystrixRx+Java Service Layer

Service Client(provided JAR)

ApplicationService

/device/endpoint(provided script)

Service

UI Teams

Mid-tierService Teams

API Team

Continually changing UI scripts and mid-tier services

Functionality, resiliency and performance drifts over time

Deployment & Ops

REMOVE MANUAL WORK pushing code to multiple AWS regions/clusters

ENABLE RAPID DEPLOYMENT of code despite limited visibility into how it's

changed

KEEP TEAM INFORMED about what's happening in prod

MITIGATE RISK of systemic failure

Tools

End-to-end Traceability Using Python/Java Glue

Code Flow

Run 1% of your traffic on the new code and see how it does

API ami-123 API ami-456

2xx4xx5xx

latencybusy threads

load...

Manually looking at graphs and SSH-ing into servers and grep-ing logs

doesn't scale(although we used to do that)

Confidence score for each AMI based on comparison of 1000+ metrics

Scannable visualization of metric space

More important

Less important

Cross-reference Jira, Link to code diffs

Track lib changes

Easy to access report artifacts for each AMI

Your basic red/black push

Doing red/black by hand for multiple clusters across multiple regions is

not fun

Automate multi-cluster/region pushes

Automate multi-cluster/region pushes

Don't forget to automate

rollbacks, too!

$Who, $What, $Where, $When

e.g., "bschmaus, ami-123, Sandbox Canary, 2013-05-06 19:05"

Latest prod change in chat topic

Quickly see status of all clusters in a region

What the #%*! just happened!?

Historical & realtime metrics, sort realtime by error/request rate

Distributed grep + tail

2013-05-09.20:38:54 MX 200 us-east-1c i-1824cb73 i-1c61b77f prod NFPS3-001-8G50FJCX... 288404769389848058 90ms api-global.netflix.com GET /tvui/release/470/plus/pathEvaluator -amazon.ami-id: ami-502eb039amazon.availability-zone: us-east-1camazon.instance-id: i-1824cb73amazon.instance-type: m2.2xlargeamazon.local-ipv4: 10.6.213.112amazon.public-hostname: ec2-54-243-4-69.compute-1.amazonaws.comamazon.public-ipv4: 54.243.4.69cookie_esn: NFPS3-001-8G50FJCX...country: MXcurrentTime: 1368131934468duration-millis: 90esn: NFPS3-001-8G50FJCX...geo.city: CIUDADOBREGON...

$ ./simple_stream.py -f -q 'e["country"]=="MX" && e["esn"]==~/NFPS3.*/' -r us

Go for haystack handing you the needle

Or at least be able to make smaller haystacks

Continuously experiment to make hard things easier

Even with the best tools, building software is hard work.

Great engineers build great software.

Want to help us build the API?

bschmaus@netflix.com@schmaus

top related