sv forum platform architecture sig - netflix open source platform
DESCRIPTION
Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on http://netflix.github.comTRANSCRIPT
The Ne&lix Open Source Pla&orm
September 26th, 2012 Adrian Cockcro8, Ruslan Meshenberg
@adrianco @rusmeshenberg #neAlixcloud hCp://www.linkedin.com/in/adriancockcro8
hCp://www.linkedin.com/in/ruslanmeshenberg
What NeAlix Did
• Moved to SaaS – Corporate IT – OneLogin, Workday, Box, Evernote… – Tools – Pagerduty, AppDynamics, ElasVc MapReduce
• Built our own PaaS – Customized to make our developers producVve – When we started, we had liCle choice
• Moved incremental capacity to IaaS – No new datacenter space since 2008 as we grew – Moved our streaming apps to the cloud
Why Use Cloud?
Things we don’t do
NeAlix Choice was AWS with our own plaAorm and tools
Unique plaAorm requirements and extreme scale, agility and flexibility
Leverage AWS Scale “the biggest public cloud” AWS investment in features and automaVon
Use AWS zones and regions for high availability, scalability and global deployment
What about other PaaS?
• CloudFoundry – Open Source by VMWare – Developer-‐friendly, easy to get started – Missing scale and some enterprise features
• Rightscale – Widely used to abstract away from AWS – Creates it’s own lock-‐in problem…
• AWS is growing into this space – We didn’t want a vendor between us and AWS – We wanted to build a thin PaaS, that gets thinner
What do developers care about?
Keeping up with Developer Trends
• Big Data/Hadoop • AWS Cloud • ApplicaVon Performance Management • Integrated DevOps PracVces • ConVnuous IntegraVon/Delivery • NoSQL • PlaAorm as a Service; Fine grain SOA • Social coding, open development/github
In producVon at NeAlix
2009 2009 2010 2010 2010 2010 2010 2011
AWS specific feature dependence….
Portability vs. FuncVonality
• Portability – the OperaVons focus – Avoid vendor lock-‐in – Support datacenter based use cases – Possible operaVons cost savings
• FuncVonality – the Developer focus – Less complex test and debug, one mature supplier – Faster Vme to market for your products – Possible developer cost savings
Portable PaaS
• Portable IaaS Base -‐ some AWS compaVbility – Eucalyptus – AWS licensed compaVble subset – CloudStack – Citrix Apache project – OpenStack – Rackspace, Cloudscaling, HP etc.
• Portable PaaS – VMWare Cloud Foundry -‐ run it yourself in your DC – AppFog and Stackato – Cloud Foundry/Openstack – Vendor opVons: Rightscale, Enstratus, Smartscale
FuncVonal PaaS
• IaaS base -‐ all the features of AWS – Very large scale, mature, global, evolving rapidly – ELB, Autoscale, VPC, SQS, EIP, EMR, DynamoDB etc. – Large files (TB) and mulVpart writes in S3
• FuncVonal PaaS – NeAlix added features – Very large scale, mature, flexible, customizable – Asgard console, Monkeys, Big data tools – Cassandra/Zookeeper data store automaVon
Developers choose FuncVonal
Don’t let the roadie write the set list! (yes you do need all those guitars on tour…)
Freedom and Responsibility
• Developers leverage cloud to get freedom – Agility of a single organizaVon, no silos
• But now developers are responsible – For compliance, performance, availability etc.
“As far as my rehab is concerned, it is within my ability to change and change for the be>er -‐ Eddie Van Halen”
Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS – Amazon Web Services (common name for Amazon cloud) • AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus applicaVon code) • EC2 – ElasVc Compute Cloud
– Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configuraVons. – Instance – a running computer system. Ephemeral, when it is de-‐allocated nothing is kept. – Reserved Instances – pre-‐paid to reduce cost for long term usage – Availability Zone – datacenter with own power and cooling hosVng cloud instances – Region – group of Avail Zones – US-‐East, US-‐West, EU-‐Eire, Asia-‐Singapore, Asia-‐Japan, SA-‐Brazil, US-‐Gov
• ASG – Auto Scaling Group (instances booVng from the same AMI) • S3 – Simple Storage Service (hCp access) • EBS – ElasVc Block Storage (network disk filesystem can be mounted on an instance) • RDS – RelaVonal Database Service (managed MySQL master and slaves) • DynamoDB/SDB – Simple Data Base (hosted hCp based NoSQL datastore, DynamoDB replaces SDB) • SQS – Simple Queue Service (hCp based message queue) • SNS – Simple NoVficaVon Service (hCp and email based topics and messages) • EMR – ElasVc Map Reduce (automaVcally managed Hadoop cluster) • ELB – ElasVc Load Balancer • EIP – ElasVc IP (stable IP address mapping assigned to instance or ELB) • VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs) • DirectConnect – secure pipe from AWS VPC to external datacenter • IAM – IdenVty and Access Management (fine grain role based security keys)
What Runs in the Cloud?
Step by Step NeAlix Product TransiVon
Non-‐Member Web Site
Member Web Site
Content Delivery Service
NeAlix APIs
Streaming Device API
Netflix Ready DevicesFrom: May 2008
To: May 2010
Current Architectural PaCerns for Availability
• Isolated Services – Resilient Business logic
• Three Balanced Availability Zones – Resilient to Infrastructure outage
• Triple Replicated Persistence – Durable distributed Storage
• Isolated Regions – US and EU don’t take each other down
Isolated Services Test With Chaos Monkey, Latency Monkey
Three Balanced Availability Zones Test with Chaos Gorilla
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
Triple Replicated Persistence Cassandra maintenance drops individual replicas
Cassandra and Evcache Replicas
Zone A
Cassandra and Evcache Replicas
Zone B
Cassandra and Evcache Replicas
Zone C
Load Balancers
Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-‐East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-‐West Load Balancers
Failure Mode Probability Mi;ga;on Plan
ApplicaVon Failure High AutomaVc degraded response
AWS Region Failure Low Wait for region to recover
AWS Zone Failure Medium ConVnue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more funcVons to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Failure Modes and Effects
Observed Regional Failures • Power Outages
– PlaAorm survives any one zone outage – Two recent zone outages, one OK, one triggered a bug
• Router Bug Takes Region Offline – A few minutes of no network traffic, then recovered – AWS has redesigned routes to be per zone
• Control Plane Overload Affects EnVre Region – Consequence of other outages – We lose control of our infrastructure
NeAlix Deployed on AWS
Content
Content Management
EC2 Encoding
S3 Petabytes
Logs
S3 Terabytes
EMR
Hive & Pig
Business Intelligence
Play
DRM
CDN rouVng
Bookmarks
Logging
WWW
Sign-‐Up
Search
Movie Choosing
RaVngs
API
Metadata
Device Config
TV Movie Choosing
Social Facebook
CS
InternaVonal CS lookup
DiagnosVcs & AcVons
Customer Call Log
CS AnalyVcs
2009 2009 2010 2010 2010 2011
CDNs ISPs
Terabits Customers
Cloud Architecture PaCerns
Where do we start?
Datacenter to Cloud TransiVon Goals
• Faster – Lower latency than the equivalent datacenter web pages and API calls – Measured as mean and 99th percenVle – For both first hit (e.g. home page) and in-‐session hits for the same user
• Scalable – Avoid needing any more datacenter capacity as subscriber count increases – No central verVcally scaled databases – Leverage AWS elasVc capacity effecVvely
• Available – SubstanVally higher robustness and availability than datacenter services – Leverage mulVple AWS availability zones – No scheduled down Vme, no central database schema to change
• ProducVve – OpVmize agility of a large development team with automaVon and tools – Leave behind complex tangled datacenter code base (~8 year old architecture) – Enforce clean layered interfaces and re-‐usable components
NeAlix Datacenter vs. Cloud Arch
Central SQL Database Distributed Key/Value NoSQL
SVcky In-‐Memory Session Shared Memcached Session
ChaCy Protocols Latency Tolerant Protocols
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service PaCerns
Fat Complex Objects Lightweight Serializable Objects
Components as Jar Files Components as Services
Availability and Resilience
Chaos Monkey
• Computers (Datacenter or AWS) randomly die – Fact of life, but too infrequent to test resiliency
• Test to make sure systems are resilient – Allow any instance to fail without customer impact
• Chaos Monkey hours – Monday-‐Friday 9am-‐3pm random instance kill
• ApplicaVon configuraVon opVon – Apps now have to opt-‐out from Chaos Monkey
Responsibility and Experience
• Make developers responsible for failures – Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix – Make sure its not about finding “who to blame”
• Keep Vmeouts short, fail fast – Don’t let cascading Vmeouts stack up
• Make configuraVon opVons dynamic – You don’t want to push code to tweak an opVon
Resilient Design – Circuit Breakers hCp://techblog.neAlix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html
Distributed OperaVonal Model
• Developers – Provision and run their own code in producVon – Take turns to be on call if it breaks (pagerduty) – Configure autoscalers to handle capacity needs
• DevOps and PaaS (aka NoOps) – DevOps is used to build and run the PaaS – PaaS constrains Dev to use automaVon instead – PaaS puts more responsibility on Dev, with tools
What’s Le8 for Corp IT? • Corporate Security and Network Management
– Billing and remnants of streaming service back-‐ends in DC • Running NeAlix’ DVD Business
– Tens of Oracle instances – Hundreds of MySQL instances – Thousands of VMWare VMs – Zabbix, CacV, Sumologic, Puppet, Chef
• Employee ProducVvity – Building networks and WiFi – SaaS OneLogin SSO Portal – Evernote Premium, Safari Online Bookshelf, Dropbox for Teams – Google Enterprise Apps, Workday HCM/Expense, Box.com – Many more SaaS migraVons coming…
Corp WiFi Performance
NeAlix OrganizaVon DevOps Org ReporVng into Product Group, not ITops
NeAlix Cloud PlaAorm Team Cloud Ops Reliability Engineering
Alert RouVng Incident Lifecycle
PagerDuty
Architecture
Future planning Security Arch Efficiency
AWS VPC Hyperguard
Powerpoint J
Build Tools and
AutomaVon
Perforce Jenkins ArVfactory JIRA Base AMI, Bakery NeAlix App Console
AWS API
PlaAorm and Persistence Engineering
PlaAorm jars Key store Zookeeper Cassandra
AWS Instances
Cloud Performance
Cassandra Benchmarking JVM GC Tuning Wiresharking
AWS Instances
Cloud SoluVons
Monitoring Monkeys Entrypoints
AWS Instances
NeAlix Open Source Strategy
• Steadily release PaaS Components git-‐by-‐git • Source at github.com/neAlix – we build from it…
• Intros and techniques at techblog.neAlix.com
Give back to Apache licensed OSS community
Lead the Best PracVces
MoVvate, regain, hire top engineers
“Peer Pressure” code cleanup
External contribuVons
Clean Code is Re-‐usable
• Use by other teams and projects inside NeAlix
Timeline
hCp://neAlix.github.com
Simian Army (Chaos Monkey) hCp://techblog.neAlix.com/2012/07/chaos-‐monkey-‐released-‐into-‐wild.html
Asgard hCp://techblog.neAlix.com/2012/06/asgard-‐web-‐based-‐cloud-‐management-‐and.html
Astyanax, Priam, Curator, Exhibitor
AcVve Pipeline
Instance creaVon
ASG / Instance started Instance Running
Asgard
Autoscaling scripts Odin
Bakery & Build tools
Base AMI
ApplicaVon Code
Instance
Image baked
RunVme
Registering, configuraVon
Eureka
Entrypoints Archaius
Governator
Async logging
Servo
ApplicaVon iniValizing
RunVme, Cont’d
Managing service Resiliency aids
Priam
Exhibitor
Explorers
NIWS LB
Astyanax
Curator
Dependency Command
REST client
Chaos Monkey Latency Monkey Janitor Monkey Cass JMeter
Calling other services
Open Source Projects Github / Techblog
Apache ContribuVons
Techblog Post
Coming Soon
Priam Cassandra as a Service
Astyanax Cassandra client for Java
CassJMeter Cassandra test suite
Cassandra MulV-‐region EC2 datastore support
Aegisthus Hadoop ETL for Cassandra
Explorers
Governator Library lifecycle and dependency injecVon
Odin Workflow orchestraVon
Async logging
Exhibitor Zookeeper as a Service
Curator Zookeeper PaCerns
EVCache Memcached as a Service
Eureka / Discovery Service Directory
Archaius Dynamics ProperVes Service
EntryPoints
Server-‐side latency/error injecVon
REST Client + mid-‐Ver LB
ConfiguraVon REST endpoints
Servo and Autoscaling Scripts
Honu Log4j streaming to Hadoop
Circuit Breaker Robust service paCern
Asgard AutoScaleGroup based AWS console
Chaos Monkey Robustness verificaVon
Latency Monkey
Janitor Monkey
Bakeries and AMI
Build dynaslaves
Legend
Repeat a8er me…
Roadmap for 2012
• More resiliency and improved availability • More automaVon, orchestraVon • “Hardening” the plaAorm, code clean-‐up • Lower latency for web services and devices • IPv6 – now running in prod, rollout in process • More open sourced components • See you at AWS Re:Invent in November…
Takeaway
NeElix has built and deployed a scalable global PlaEorm as a Service.
Key components of the NeElix PaaS are being released as Open Source projects so you can build your own custom PaaS.
hCp://github.com/NeAlix hCp://techblog.neAlix.com hCp://slideshare.net/NeAlix
hCp://www.linkedin.com/in/adriancockcro8
hCp://www.linkedin.com/in/ruslanmeshenberg
@adrianco @rusmeshenberg #neAlixcloud