netflix presents at masstlc cloud summit 2013

33
@atseitlin Ne#lix Cloud Pla#orm Ne#lix's evolu3on in the cloud Ariel Tseitlin h.p://www.linkedin.com/in/atseitlin @atseitlin

Upload: masstlc

Post on 09-May-2015

298 views

Category:

Self Improvement


3 download

DESCRIPTION

Ariel Tseitlin, Director of the Netflix Cloud presented on the elasticity and redundancy of its Cloud service.

TRANSCRIPT

Page 1: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Ne#lix  Cloud  Pla#orm      

Ne#lix's  evolu3on  in  the  cloud  

 Ariel  Tseitlin  

h.p://www.linkedin.com/in/atseitlin  @atseitlin  

 

Page 2: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

About  Ne<lix  Ne#lix  is  the  world’s  leading  Internet  television  network  with  nearly  38  million  members  in  40  countries  enjoying  more  than  one  billion  hours  of  TV  shows  and  movies  per  month,  including  original  series[1]  

[1]  h.p://ir.ne<lix.com/  

Page 3: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Original  Content  

Page 4: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

CriDcal  Acclaim  

Page 5: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

A  complex  distributed  system  

Page 6: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

How  Ne<lix  Streaming  Works  

Customer  Device  (PC,  PS3,  TV…)  

Web  Site  or  Discovery  API  

User  Data  

PersonalizaDon  

Streaming  API  

DRM  

QoS  Logging  

OpenConnect  CDN  Boxes  

CDN  Management  and  

Steering  

Content  Encoding  

Consumer  Electronics  

AWS  Cloud  Services  

CDN  Edge  LocaDons  

Browse  

Play  

Watch  

Page 7: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Highly  Available  Architecture  

Micro-­‐services,  redundancy,  resiliency  

Page 8: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Web  Server  Dependencies  Flow  Home  page  business  transacDon  

Start  Here  

memcached  

Cassandra  

Web  service  

S3  bucket  

PersonalizaDon  movie  group  chooser  

Each  icon  is  three  to  a  few  hundred  instances  across  three  AWS  zones  

Page 9: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Component  Micro-­‐Services  Test  With  Chaos  Monkey,  Latency  Monkey  

Page 10: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Three  Balanced  Availability  Zones  Test  with  Chaos  Gorilla  

Cassandra  and  Evcache  Replicas  

Zone  A  

Cassandra  and  Evcache  Replicas  

Zone  B  

Cassandra  and  Evcache  Replicas  

Zone  C  

Load  Balancers  

Page 11: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Triple  Replicated  Persistence  Cassandra  maintenance  affects  individual  replicas    

Cassandra  and  Evcache  Replicas  

Zone  A  

Cassandra  and  Evcache  Replicas  

Zone  B  

Cassandra  and  Evcache  Replicas  

Zone  C  

Load  Balancers  

Page 12: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Isolated  Regions  Will  someday  test  with  Chaos  Kong  

Cassandra  Replicas  

Zone  A  

Cassandra  Replicas  

Zone  B  

Cassandra  Replicas  

Zone  C  

US-­‐East  Load  Balancers  

Cassandra  Replicas  

Zone  A  

Cassandra  Replicas  

Zone  B  

Cassandra  Replicas  

Zone  C  

EU-­‐West  Load  Balancers  

Page 13: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Failure  Modes  and  Effects  Failure  Mode   Probability   Current  Mi3ga3on  Plan  

ApplicaDon  Failure   High   AutomaDc  degraded  response  

AWS  Region  Failure   Low   Wait  for  region  to  recover  

AWS  Zone  Failure   Medium   ConDnue  to  run  on  2  out  of  3  zones  

Datacenter  Failure   Medium   Migrate  more  funcDons  to  cloud  

Data  store  failure   Low   Restore  from  S3  backups  

S3  failure   Low   Restore  from  remote  archive  

UnDl  we  got  really  good  at  miDgaDng  high  and  medium  probability  failures,  the  ROI  for  miDgaDng  regional  failures  didn’t  make  sense.  Gedng  there…  

Page 14: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

ApplicaDon  Resilience  

Run  what  you  wrote  Rapid  detecDon  Rapid  Response  

Fail  oeen    

Page 15: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Run  What  You  Wrote  

•  Make  developers  responsible  for  failures  – Then  they  learn  and  write  code  that  doesn’t  fail  

•  Use  Incident  Reviews  to  find  gaps  to  fix  – Make  sure  its  not  about  finding  “who  to  blame”  

•  Keep  Dmeouts  short,  fail  fast  – Don’t  let  cascading  Dmeouts  stack  up  

Page 16: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Rapid  DetecDon  

•  If  your  pilot  had  no  instument  panel,  would  you  ever  board  fly  on  a  plane?  – Never  run  your  service  blind  

•  Monitor  services,  not  instances  – Make  instance  failure  a  non-­‐event  

•  Don’t  pay  people  to  watch  screens  –  Instead  pay  them  to  build  alerDng  

Page 17: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Rapid  Rollback  

•  Use  a  new  Autoscale  Group  to  push  code  

•  Leave  exisDng  ASG  in  place,  switch  traffic  

•  If  OK,  auto-­‐delete  old  ASG  a  few  hours  later  

•  If  “whoops”,  switch  traffic  back  in  seconds  

Page 18: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Asgard  h.p://techblog.ne<lix.com/2012/06/asgard-­‐web-­‐based-­‐cloud-­‐management-­‐and.html  

Page 19: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Made  possible  in  the  cloud  

APIs,  ElasDcity,  Efficiency  

Page 20: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

APIs  

•  Control  everything  (start,  terminate,  scale)  

•  Inject  failure  

•  Monitor  &  audit  

•  Automate  operaDons  

Page 21: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

ElasDcity  

•  Capacity  planning  replaced  with  forecasDng  

•  Dynamic  load-­‐based  auto-­‐scaling  

•  New  data  centers  at  the  click  of  a  bu.on  

Page 22: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Efficiency  

•  ~10x  trough  to  peak  raDo.    Fill  trough  with  batch  workloads  

•  OpDmize  machine  class  for  each  service  

•  Highly  available  red/black  deployments  

Page 23: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Coming  soon  to  a  cloud  near  you  

Billing  &  Payments,  Big  Data  &  AnalyDcs,  SaaS  

Page 24: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Billing  &  Payments  

•  PCI  compliance  

•  Privacy  &  security  

•  Intermediate  step  of  cache  in  the  cloud  

Page 25: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Big  Data  &  AnalyDcs  

•  On  deck  for  cloud  migraDon  

•  ETL  already  in  cloud  with  EMR  (Hadoop)  

•  Many  cloud  alternaDves  but  not  yet  as  mature  as  the  old  guard  

Page 26: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Corporate  system  moving  to  SaaS  

•  Email  (Exchange-­‐>Google  Apps)  

•  Expense  Management  (Concur-­‐>Workday)  

•  Document  sharing  (File  Servers-­‐>Box)  

•  Goal  is  100%  SaaS  

Page 27: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Page 28: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Open  Source  Projects  Github  /  Techblog  

Apache  ContribuDons  

Techblog  Post  

Coming  Soon  

Priam  Cassandra  as  a  Service  

Astyanax  Cassandra  client  for  Java  

CassJMeter  Cassandra  test  suite  

Cassandra  MulD-­‐region  EC2  datastore  

support  

Aegisthus  Hadoop  ETL  for  Cassandra  

Ice  Spend  analyDcs  

Governator  Library  lifecycle  and  dependency  

injecDon  

Odin  Cloud  orchestraDon  

Blitz4j  Async  logging  

Exhibitor  Zookeeper  as  a  Service  

Curator  Zookeeper  Pa.erns  

EVCache  Memcached  as  a  Service  

Eureka  /  Discovery  Service  Directory  

Archaius  Dynamics  ProperDes  Service  

Edda  Config  state  with  history  

Denominator    

Ribbon  REST  Client  +  mid-­‐Der  LB  

Karyon  Instrumented  REST  Base  Serve  

Servo  and  Autoscaling  Scripts  

Genie  Hadoop  PaaS  

Hystrix  Robust  service  pa.ern  

RxJava  ReacDve  Pa.erns  

Asgard  AutoScaleGroup  based  AWS  

console  

Chaos  Monkey  Robustness  verificaDon  

Latency  Monkey  

Janitor  Monkey  

Bakeries  /  Aminotor  

Legend  

Page 29: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Page 30: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Our  Current  Catalog  of  Releases  Free  code  available  at  h.p://ne<lix.github.com  

Page 31: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

We’re  hiring!  

•  Simian  Army  •  Cloud  Tools  •  Ne<lixOSS  •  Cloud  OperaDons  •  Reliability  Engineering  •  Many,  many  more  

             jobs.ne<lix.com  

Page 32: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Takeaways    

Ne#lix  has  built  and  deployed  a  scalable  global  and  highly  available  Pla#orm  as  a  Service  and  opened  sourced  it  (Ne#lixOSS)  

 The  Cloud  enables  elasNcity,  efficiency  and  fine-­‐grained  control  via  APIs  

 Credit  cards,  Big  Data,  and  rest  of  corporate  systems  are  next  to  move  to  the  Cloud  

   

h.p://ne<lix.github.com  h.p://techblog.ne<lix.com  h.p://slideshare.net/Ne<lix  

 h.p://www.linkedin.com/in/atseitlin  

 @atseitlin  @Ne<lixOSS  

Page 33: Netflix presents at MassTLC Cloud Summit 2013

@atseitlin  

Thank  you!  

Any  quesDons?  

Ariel  Tseitlin  h.p://www.linkedin.com/in/atseitlin  

@atseitlin