sv forum platform architecture sig - netflix open source platform

60
The Ne&lix Open Source Pla&orm September 26th, 2012 Adrian Cockcro8, Ruslan Meshenberg @adrianco @rusmeshenberg #neAlixcloud hCp://www.linkedin.com/in/adriancockcro8 hCp://www.linkedin.com/in/ruslanmeshenberg

Upload: adrian-cockcroft

Post on 15-Jan-2015

8.148 views

Category:

Technology


2 download

DESCRIPTION

Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on http://netflix.github.com

TRANSCRIPT

Page 1: SV Forum Platform Architecture SIG - Netflix Open Source Platform

The  Ne&lix  Open  Source  Pla&orm  

September  26th,  2012  Adrian  Cockcro8,  Ruslan  Meshenberg  

 @adrianco  @rusmeshenberg  #neAlixcloud  hCp://www.linkedin.com/in/adriancockcro8  

hCp://www.linkedin.com/in/ruslanmeshenberg    

Page 2: SV Forum Platform Architecture SIG - Netflix Open Source Platform

What  NeAlix  Did  

•  Moved  to  SaaS  –  Corporate  IT  –  OneLogin,  Workday,  Box,  Evernote…  –  Tools  –  Pagerduty,  AppDynamics,  ElasVc  MapReduce  

•  Built  our  own  PaaS  –  Customized  to  make  our  developers  producVve  – When  we  started,  we  had  liCle  choice  

•  Moved  incremental  capacity  to  IaaS  – No  new  datacenter  space  since  2008  as  we  grew  – Moved  our  streaming  apps  to  the  cloud  

Page 3: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Why  Use  Cloud?      

Page 4: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Things  we  don’t  do  

Page 5: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  Choice  was  AWS  with  our  own  plaAorm  and  tools  

Unique  plaAorm  requirements  and  extreme  scale,  agility  and  flexibility  

Page 6: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Leverage  AWS  Scale  “the  biggest  public  cloud”  AWS  investment  in  features  and  automaVon  

Use  AWS  zones  and  regions  for  high  availability,  scalability  and  global  deployment  

Page 7: SV Forum Platform Architecture SIG - Netflix Open Source Platform

What  about  other  PaaS?  

•  CloudFoundry  –  Open  Source  by  VMWare  – Developer-­‐friendly,  easy  to  get  started  – Missing  scale  and  some  enterprise  features  

•  Rightscale  – Widely  used  to  abstract  away  from  AWS  – Creates  it’s  own  lock-­‐in  problem…  

•  AWS  is  growing  into  this  space  – We  didn’t  want  a  vendor  between  us  and  AWS  – We  wanted  to  build  a  thin  PaaS,  that  gets  thinner  

Page 8: SV Forum Platform Architecture SIG - Netflix Open Source Platform

What  do  developers  care  about?  

Page 9: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Keeping  up  with  Developer  Trends  

•  Big  Data/Hadoop  •  AWS  Cloud  •  ApplicaVon  Performance  Management  •  Integrated  DevOps  PracVces  •  ConVnuous  IntegraVon/Delivery  •  NoSQL  •  PlaAorm  as  a  Service;  Fine  grain  SOA  •  Social  coding,  open  development/github  

In  producVon  at  NeAlix  

2009  2009  2010  2010  2010  2010  2010  2011  

Page 10: SV Forum Platform Architecture SIG - Netflix Open Source Platform

AWS  specific  feature  dependence….      

Page 11: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Portability  vs.  FuncVonality  

•  Portability  –  the  OperaVons  focus  – Avoid  vendor  lock-­‐in  – Support  datacenter  based  use  cases  – Possible  operaVons  cost  savings  

•  FuncVonality  –  the  Developer  focus  – Less  complex  test  and  debug,  one  mature  supplier  – Faster  Vme  to  market  for  your  products  – Possible  developer  cost  savings  

Page 12: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Portable  PaaS  

•  Portable  IaaS  Base  -­‐  some  AWS  compaVbility  –  Eucalyptus  –  AWS  licensed  compaVble  subset  –  CloudStack  –  Citrix  Apache  project  – OpenStack  –  Rackspace,  Cloudscaling,  HP  etc.  

•  Portable  PaaS  –  VMWare  Cloud  Foundry  -­‐  run  it  yourself  in  your  DC  – AppFog  and  Stackato  –  Cloud  Foundry/Openstack  –  Vendor  opVons:  Rightscale,  Enstratus,  Smartscale  

Page 13: SV Forum Platform Architecture SIG - Netflix Open Source Platform

FuncVonal  PaaS  

•  IaaS  base  -­‐  all  the  features  of  AWS  –  Very  large  scale,  mature,  global,  evolving  rapidly  –  ELB,  Autoscale,  VPC,  SQS,  EIP,  EMR,  DynamoDB  etc.  –  Large  files  (TB)  and  mulVpart  writes  in  S3  

•  FuncVonal  PaaS  –  NeAlix  added  features  –  Very  large  scale,  mature,  flexible,  customizable  – Asgard  console,  Monkeys,  Big  data  tools  –  Cassandra/Zookeeper  data  store  automaVon  

Page 14: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Developers  choose  FuncVonal    

Don’t  let  the  roadie  write  the  set  list!  (yes  you  do  need  all  those  guitars  on  tour…)  

Page 15: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Freedom  and  Responsibility  

•  Developers  leverage  cloud  to  get  freedom  – Agility  of  a  single  organizaVon,  no  silos  

•  But  now  developers  are  responsible  – For  compliance,  performance,  availability  etc.  

“As  far  as  my  rehab  is  concerned,  it  is  within  my  ability  to  change  and  change  for  the  be>er  -­‐  Eddie  Van  Halen”    

Page 16: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Amazon Cloud Terminology Reference See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

•  AWS  –  Amazon  Web  Services  (common  name  for  Amazon  cloud)  •  AMI  –  Amazon  Machine  Image  (archived  boot  disk,  Linux,  Windows  etc.  plus  applicaVon  code)  •  EC2  –  ElasVc  Compute  Cloud  

–  Range  of  virtual  machine  types  m1,  m2,  c1,  cc,  cg.  Varying  memory,  CPU  and  disk  configuraVons.  –  Instance  –  a  running  computer  system.  Ephemeral,  when  it  is  de-­‐allocated  nothing  is  kept.  –  Reserved  Instances  –  pre-­‐paid  to  reduce  cost  for  long  term  usage  –  Availability  Zone  –  datacenter  with  own  power  and  cooling  hosVng  cloud  instances  –  Region  –  group  of  Avail  Zones  –  US-­‐East,  US-­‐West,  EU-­‐Eire,  Asia-­‐Singapore,  Asia-­‐Japan,  SA-­‐Brazil,  US-­‐Gov  

•  ASG  –  Auto  Scaling  Group  (instances  booVng  from  the  same  AMI)  •  S3  –  Simple  Storage  Service  (hCp  access)  •  EBS  –  ElasVc  Block  Storage  (network  disk  filesystem  can  be  mounted  on  an  instance)  •  RDS  –  RelaVonal  Database  Service  (managed  MySQL  master  and  slaves)  •  DynamoDB/SDB  –  Simple  Data  Base  (hosted  hCp  based  NoSQL  datastore,  DynamoDB  replaces  SDB)  •  SQS  –  Simple  Queue  Service  (hCp  based  message  queue)  •  SNS  –  Simple  NoVficaVon  Service  (hCp  and  email  based  topics  and  messages)  •  EMR  –  ElasVc  Map  Reduce  (automaVcally  managed  Hadoop  cluster)  •  ELB  –  ElasVc  Load  Balancer  •  EIP  –  ElasVc  IP  (stable  IP  address  mapping  assigned  to  instance  or  ELB)  •  VPC  –  Virtual  Private  Cloud  (single  tenant,  more  flexible  network  and  security  constructs)  •  DirectConnect  –  secure  pipe  from  AWS  VPC  to  external  datacenter  •  IAM  –  IdenVty  and  Access  Management  (fine  grain  role  based  security  keys)  

Page 17: SV Forum Platform Architecture SIG - Netflix Open Source Platform

What  Runs  in  the  Cloud?  

Step  by  Step  NeAlix  Product  TransiVon  

Page 18: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Non-­‐Member  Web  Site  

Page 19: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Member  Web  Site  

Page 20: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Content  Delivery  Service  

Page 21: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  APIs  

Page 22: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Streaming  Device  API  

Netflix Ready DevicesFrom: May 2008

To: May 2010

Page 23: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Current  Architectural  PaCerns  for  Availability  

•  Isolated  Services  – Resilient  Business  logic  

•  Three  Balanced  Availability  Zones  – Resilient  to  Infrastructure  outage  

•  Triple  Replicated  Persistence  – Durable  distributed  Storage  

•  Isolated  Regions  – US  and  EU  don’t  take  each  other  down  

Page 24: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Isolated  Services  Test  With  Chaos  Monkey,  Latency  Monkey  

Page 25: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Three  Balanced  Availability  Zones  Test  with  Chaos  Gorilla  

Cassandra  and  Evcache  Replicas  

Zone  A  

Cassandra  and  Evcache  Replicas  

Zone  B  

Cassandra  and  Evcache  Replicas  

Zone  C  

Load  Balancers  

Page 26: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Triple  Replicated  Persistence  Cassandra  maintenance  drops  individual  replicas    

Cassandra  and  Evcache  Replicas  

Zone  A  

Cassandra  and  Evcache  Replicas  

Zone  B  

Cassandra  and  Evcache  Replicas  

Zone  C  

Load  Balancers  

Page 27: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Isolated  Regions  

Cassandra  Replicas  

Zone  A  

Cassandra  Replicas  

Zone  B  

Cassandra  Replicas  

Zone  C  

US-­‐East  Load  Balancers  

Cassandra  Replicas  

Zone  A  

Cassandra  Replicas  

Zone  B  

Cassandra  Replicas  

Zone  C  

EU-­‐West  Load  Balancers  

Page 28: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Failure  Mode   Probability   Mi;ga;on  Plan  

ApplicaVon  Failure   High   AutomaVc  degraded  response  

AWS  Region  Failure   Low   Wait  for  region  to  recover  

AWS  Zone  Failure   Medium   ConVnue  to  run  on  2  out  of  3  zones  

Datacenter  Failure   Medium   Migrate  more  funcVons  to  cloud  

Data  store  failure   Low   Restore  from  S3  backups  

S3  failure   Low   Restore  from  remote  archive  

Failure  Modes  and  Effects  

Page 29: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Observed  Regional  Failures  •  Power  Outages  

–  PlaAorm  survives  any  one  zone  outage  –  Two  recent  zone  outages,  one  OK,  one  triggered  a  bug  

•  Router  Bug  Takes  Region  Offline  –  A  few  minutes  of  no  network  traffic,  then  recovered  –  AWS  has  redesigned  routes  to  be  per  zone  

•  Control  Plane  Overload  Affects  EnVre  Region  –  Consequence  of  other  outages  – We  lose  control  of  our  infrastructure  

Page 30: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  Deployed  on  AWS  

Content  

Content  Management  

EC2  Encoding  

S3  Petabytes  

Logs  

S3  Terabytes  

EMR  

Hive  &  Pig  

Business  Intelligence  

Play  

DRM  

CDN  rouVng  

Bookmarks  

Logging  

WWW  

Sign-­‐Up  

Search  

Movie  Choosing  

RaVngs  

API  

Metadata  

Device  Config  

TV  Movie  Choosing  

Social  Facebook  

CS  

InternaVonal  CS  lookup  

DiagnosVcs  &  AcVons  

Customer  Call  Log  

CS  AnalyVcs  

2009   2009   2010   2010   2010   2011  

CDNs  ISPs  

Terabits  Customers  

Page 31: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Cloud  Architecture  PaCerns  

Where  do  we  start?  

Page 32: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Datacenter  to  Cloud  TransiVon  Goals  

•  Faster  –  Lower  latency  than  the  equivalent  datacenter  web  pages  and  API  calls  –  Measured  as  mean  and  99th  percenVle  –  For  both  first  hit  (e.g.  home  page)  and  in-­‐session  hits  for  the  same  user  

•  Scalable  –  Avoid  needing  any  more  datacenter  capacity  as  subscriber  count  increases  –  No  central  verVcally  scaled  databases  –  Leverage  AWS  elasVc  capacity  effecVvely  

•  Available  –  SubstanVally  higher  robustness  and  availability  than  datacenter  services  –  Leverage  mulVple  AWS  availability  zones  –  No  scheduled  down  Vme,  no  central  database  schema  to  change  

•  ProducVve  –  OpVmize  agility  of  a  large  development  team  with  automaVon  and  tools  –  Leave  behind  complex  tangled  datacenter  code  base  (~8  year  old  architecture)  –  Enforce  clean  layered  interfaces  and  re-­‐usable  components  

Page 33: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  Datacenter  vs.  Cloud  Arch  

Central  SQL  Database   Distributed  Key/Value  NoSQL  

SVcky  In-­‐Memory  Session   Shared  Memcached  Session  

ChaCy  Protocols   Latency  Tolerant  Protocols  

Tangled  Service  Interfaces   Layered  Service  Interfaces  

Instrumented  Code   Instrumented  Service  PaCerns  

Fat  Complex  Objects   Lightweight  Serializable  Objects  

Components  as  Jar  Files   Components  as  Services  

Page 34: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Availability  and  Resilience  

Page 35: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Chaos  Monkey  

•  Computers  (Datacenter  or  AWS)  randomly  die  – Fact  of  life,  but  too  infrequent  to  test  resiliency  

•  Test  to  make  sure  systems  are  resilient  – Allow  any  instance  to  fail  without  customer  impact  

•  Chaos  Monkey  hours  – Monday-­‐Friday  9am-­‐3pm  random  instance  kill  

•  ApplicaVon  configuraVon  opVon  – Apps  now  have  to  opt-­‐out  from  Chaos  Monkey  

Page 36: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Responsibility  and  Experience  

•  Make  developers  responsible  for  failures  – Then  they  learn  and  write  code  that  doesn’t  fail  

•  Use  Incident  Reviews  to  find  gaps  to  fix  – Make  sure  its  not  about  finding  “who  to  blame”  

•  Keep  Vmeouts  short,  fail  fast  – Don’t  let  cascading  Vmeouts  stack  up  

•  Make  configuraVon  opVons  dynamic  – You  don’t  want  to  push  code  to  tweak  an  opVon  

Page 37: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Resilient  Design  –  Circuit  Breakers  hCp://techblog.neAlix.com/2012/02/fault-­‐tolerance-­‐in-­‐high-­‐volume.html  

Page 38: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Distributed  OperaVonal  Model  

•  Developers  – Provision  and  run  their  own  code  in  producVon  – Take  turns  to  be  on  call  if  it  breaks  (pagerduty)  – Configure  autoscalers  to  handle  capacity  needs  

•  DevOps  and  PaaS  (aka  NoOps)  – DevOps  is  used  to  build  and  run  the  PaaS  – PaaS  constrains  Dev  to  use  automaVon  instead  – PaaS  puts  more  responsibility  on  Dev,  with  tools  

Page 39: SV Forum Platform Architecture SIG - Netflix Open Source Platform

What’s  Le8  for  Corp  IT?  •  Corporate  Security  and  Network  Management  

–  Billing  and  remnants  of  streaming  service  back-­‐ends  in  DC  •  Running  NeAlix’  DVD  Business  

–  Tens  of  Oracle  instances  –  Hundreds  of  MySQL  instances  –  Thousands  of  VMWare  VMs  –  Zabbix,  CacV,  Sumologic,  Puppet,  Chef  

•  Employee  ProducVvity  –  Building  networks  and  WiFi  –  SaaS  OneLogin  SSO  Portal  –  Evernote  Premium,  Safari  Online  Bookshelf,  Dropbox  for  Teams  –  Google  Enterprise  Apps,  Workday  HCM/Expense,  Box.com  –  Many  more  SaaS  migraVons  coming…  

Corp  WiFi  Performance  

Page 40: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  OrganizaVon  DevOps  Org  ReporVng  into  Product  Group,  not  ITops  

NeAlix  Cloud  PlaAorm  Team  Cloud  Ops  Reliability  Engineering  

Alert  RouVng  Incident  Lifecycle  

PagerDuty  

Architecture  

Future  planning  Security  Arch  Efficiency  

AWS  VPC  Hyperguard  

Powerpoint  J  

Build  Tools  and  

AutomaVon  

Perforce  Jenkins  ArVfactory  JIRA  Base  AMI,  Bakery  NeAlix  App  Console  

AWS  API  

PlaAorm  and  Persistence  Engineering  

PlaAorm  jars  Key  store  Zookeeper  Cassandra  

AWS  Instances  

Cloud  Performance  

Cassandra  Benchmarking  JVM  GC  Tuning  Wiresharking  

AWS  Instances  

Cloud  SoluVons  

Monitoring  Monkeys  Entrypoints  

AWS  Instances  

Page 41: SV Forum Platform Architecture SIG - Netflix Open Source Platform

NeAlix  Open  Source  Strategy  

•  Steadily  release  PaaS  Components  git-­‐by-­‐git    •  Source  at  github.com/neAlix  –  we  build  from  it…  

 •  Intros  and  techniques  at  techblog.neAlix.com  

Page 42: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Give  back  to  Apache  licensed  OSS  community    

Page 43: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Lead  the  Best  PracVces  

Page 44: SV Forum Platform Architecture SIG - Netflix Open Source Platform

MoVvate,  regain,  hire  top  engineers  

Page 45: SV Forum Platform Architecture SIG - Netflix Open Source Platform

“Peer  Pressure”  code  cleanup  

Page 46: SV Forum Platform Architecture SIG - Netflix Open Source Platform

External  contribuVons  

Page 47: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Clean  Code  is  Re-­‐usable  

•  Use  by  other  teams  and  projects  inside  NeAlix  

Page 48: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Timeline  

Page 49: SV Forum Platform Architecture SIG - Netflix Open Source Platform

hCp://neAlix.github.com  

Page 50: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Simian  Army  (Chaos  Monkey)  hCp://techblog.neAlix.com/2012/07/chaos-­‐monkey-­‐released-­‐into-­‐wild.html      

Page 51: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Asgard  hCp://techblog.neAlix.com/2012/06/asgard-­‐web-­‐based-­‐cloud-­‐management-­‐and.html  

Page 52: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Astyanax,  Priam,  Curator,  Exhibitor  

   

Page 53: SV Forum Platform Architecture SIG - Netflix Open Source Platform

AcVve  Pipeline  

   

Page 54: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Instance  creaVon  

ASG  /  Instance  started   Instance  Running  

Asgard  

Autoscaling  scripts  Odin  

Bakery  &  Build  tools  

Base  AMI  

ApplicaVon  Code  

Instance  

Image  baked  

Page 55: SV Forum Platform Architecture SIG - Netflix Open Source Platform

RunVme  

Registering,  configuraVon  

Eureka  

Entrypoints  Archaius  

Governator  

Async  logging  

Servo  

ApplicaVon  iniValizing  

Page 56: SV Forum Platform Architecture SIG - Netflix Open Source Platform

RunVme,  Cont’d  

Managing  service   Resiliency  aids  

Priam  

Exhibitor  

Explorers  

NIWS  LB  

Astyanax  

Curator  

Dependency  Command  

REST  client  

Chaos  Monkey  Latency  Monkey  Janitor  Monkey  Cass  JMeter  

Calling  other  services  

Page 57: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Open  Source  Projects  Github  /  Techblog  

Apache  ContribuVons  

Techblog  Post  

Coming  Soon  

Priam  Cassandra  as  a  Service  

Astyanax  Cassandra  client  for  Java  

CassJMeter  Cassandra  test  suite  

Cassandra  MulV-­‐region  EC2  datastore  support  

Aegisthus  Hadoop  ETL  for  Cassandra  

Explorers  

Governator  Library  lifecycle  and  dependency  injecVon  

Odin  Workflow  orchestraVon  

Async  logging  

Exhibitor  Zookeeper  as  a  Service  

Curator  Zookeeper  PaCerns  

EVCache  Memcached  as  a  Service  

Eureka  /  Discovery  Service  Directory  

Archaius  Dynamics  ProperVes  Service  

EntryPoints  

Server-­‐side  latency/error  injecVon  

REST  Client  +  mid-­‐Ver  LB  

ConfiguraVon  REST  endpoints  

Servo  and  Autoscaling  Scripts  

Honu  Log4j  streaming  to  Hadoop  

Circuit  Breaker  Robust  service  paCern  

Asgard  AutoScaleGroup  based  AWS  console  

Chaos  Monkey  Robustness  verificaVon  

Latency  Monkey  

Janitor  Monkey  

Bakeries  and  AMI  

Build  dynaslaves  

Legend  

Page 58: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Repeat  a8er  me…  

Page 59: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Roadmap  for  2012  

•  More  resiliency  and  improved  availability  •  More  automaVon,  orchestraVon  •  “Hardening”  the  plaAorm,  code  clean-­‐up  •  Lower  latency  for  web  services  and  devices  •  IPv6  –  now  running  in  prod,  rollout  in  process  •  More  open  sourced  components  •  See  you  at  AWS  Re:Invent  in  November…  

Page 60: SV Forum Platform Architecture SIG - Netflix Open Source Platform

Takeaway    

NeElix  has  built  and  deployed  a  scalable  global  PlaEorm  as  a  Service.    

Key  components  of  the  NeElix  PaaS  are  being  released  as  Open  Source  projects  so  you  can  build  your  own  custom  PaaS.  

 hCp://github.com/NeAlix  hCp://techblog.neAlix.com  hCp://slideshare.net/NeAlix  

 hCp://www.linkedin.com/in/adriancockcro8  

hCp://www.linkedin.com/in/ruslanmeshenberg    

@adrianco  @rusmeshenberg  #neAlixcloud