gluecon 2013 - netflixoss cloud native tutorial introduction
Post on 07-May-2015
13.726 Views
Preview:
DESCRIPTION
TRANSCRIPT
Introduction:Building Using The NetflixOSS
Architecture
May 2013Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSShttp://www.linkedin.com/in/adriancockcroft
Presentation vs. Tutorial
• Presentation– Short duration, focused subject– One presenter to many anonymous audience– A few questions at the end
• Tutorial– Time to explore in and around the subject– Tutor gets to know the audience– Discussion, rat-holes, “bring out your dead”
Introduction – Who are you?
Netflix Open Source Cloud Prize
Cloud Native – More details
NetflixOSS – Cloud Native On-Ramp
Adrian Cockcroft• Director, Architecture for Cloud Systems, Netflix Inc.
– Previously Director for Personalization Platform
• Distinguished Availability Engineer, eBay Inc. 2004-7– Founding member of eBay Research Labs
• Distinguished Engineer, Sun Microsystems Inc. 1988-2004– 2003-4 Chief Architect High Performance Technical Computing– 2001 Author: Capacity Planning for Web Services– 1999 Author: Resource Management– 1995 & 1998 Author: Sun Performance and Tuning– 1996 Japanese Edition of Sun Performance and Tuning
• SPARC & Solaris パフォーマンスチューニング ( サンソフトプレスシリーズ )
• More– Twitter @adrianco – Blog http://perfcap.blogspot.com– Presentations at http://www.slideshare.net/adrianco
Attendee Introductions
• Who are you, where do you work• Why are you here today, what do you need• “Bring out your dead”– Do you have a specific problem or question?– One sentence elevator pitch
• What instrument do you play?
Boosting the @NetflixOSS EcosystemSee netflix.github.com
In 2012 Netflix Engineering won this..
We’d like to give out prizes too
But what for?Contributions to NetflixOSS!Shared under Apache license
Located on github
Best example application mash-up
Best new monkey
Best contribution to code quality
Best new feature
Best contribution to operational tools
Best portability enhancement
Best datastore integration
Best contribution to performance
Best usability enhancement
Judges choice award
How long do you have?
Entries open March 13th
Entries close September 15th
Six months…
Who can win?
Almost anyone, anywhere…Except current or former Netflix or
AWS employees
Who decides who wins?
Nominating CommitteePanel of Judges
Judges
Aino CorryProgram Chair for Qcon/GOTO
Martin FowlerChief Scientist ThoughtworksSimon Wardley
Strategist
Yury IzrailevskyVP Cloud Netflix
Werner VogelsCTO Amazon Joe Weinman
SVP Telx, Author “Cloudonomics”
What are Judges Looking For?Eligible, Apache 2.0 licensed
NetflixOSS project pull requests
Original and useful contribution to NetflixOSS
Good code quality and structure
Documentation on how to build and run it
Code that successfully builds and passes a test suite
Evidence that code is in use by other projects, or is running in production
A large number of watchers, stars and forks on github
What do you win?
One winner in each of the 10 categoriesTicket and expenses to attend AWS
Re:Invent 2013 in Las VegasA Trophy
$10,000 cash and $5,000 in AWS Credits
How do you enter?
Get a (free) github accountFork github.com/netflix/cloud-prize
Send us your email addressDescribe and build your entry
Twitter #cloudprize
Entrants
NetflixEngineering
Six Judges Winners
Nominations
Conforms to Rules
Working Code
Community Traction
Categories
Registration Opened
March 13Github
Apache Licensed
ContributionsGithub Close Entries
September 15GithubAward
Ceremony Dinner
November
AWS Re:Invent
Ten Prize Categories
$10K cash$5K AWS
AWS Re:Invent
TicketsTrophy
Cloud Native
Recap the keynote in much more detail and discussion
A new engineering challenge
Construct a highly agile and highly available service from ephemeral and
often broken components
Inspiration
Netflix Streaming
A Cloud Native Application based on an open source platform
Netflix Member Web Site Home PagePersonalization Driven – How Does It Work?
How Netflix Streaming Works
Customer Device (PC, PS3, TV…)
Web Site or Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect CDN Boxes
CDN Management and
Steering
Content Encoding
Consumer Electronics
AWS Cloud Services
CDN Edge Locations
Real Web Server Dependencies Flow(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers (for US, Canada and Latam)
Each icon is three to a few hundred instances across three AWS zones
New Cloud Native Patterns
Micro-services and Chaos enginesHighly available systems composed
from ephemeral componentsOpen Source is the default
Some Strategic Questions
What changed…
The AWS Question
Why does Netflix use AWS when Amazon Prime is a competitor?
Netflix vs. Amazon Prime
• Do retailers competing with Amazon use AWS?– Yes, lots of them, Netflix is no different
• Does Prime have a platform advantage?– No, because Netflix gets to run on AWS
• Does Netflix take Amazon Prime seriously?– Yes, but so far Prime isn’t impacting our business
Amazon Video 1.31%
18x Prime
25x Prime
Nov2012StreamingBandwidth
March2013
MeanBandwidth+39% 6mo
The Google Cloud Question
Why doesn’t Netflix use Google Cloud as well as AWS?
Google Cloud – Wait and See
Pro’s• Cloud Native• Huge scale for internal apps• Exposing internal services• Nice clean API model• Starting a price war• Fast for what it does• Rapid start & minute billing
Con’s• In beta until last week• No big customers yet• Missing many key features• Different arch model• Missing billing options• No SSD or huge instances• Zone maintenance windows
But: Anyone interested is welcome to port NetflixOSS components to Google Cloud
Cloud Wars: Price and Performance
AWS vs. GCS War
Private Cloud
Power cost increase
Labor cost increase
Aging system failures
Maintenance renewal
Storage cost reduction
Instance cost reduction
Faster newer systems
What Changed:Everyone using AWS or GCS gets the price cuts and performance improvements, as they happen. No need to switch vendor.
No Change:Locked in for three years.
The DIY Question
Why doesn’t Netflix build and run its own cloud?
Fitting Into Public Scale
Public Grey Area Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
How big is Public?
AWS upper bound estimate based on the number of public IP AddressesEvery provisioned instance gets a public IP by default
AWS Maximum Possible Instance Count 3.7 MillionGrowth >10x in Three Years, >2x Per Annum
The Alternative Supplier Question
What if there is no clear leader for a feature, or AWS doesn’t have what
we need?
Things We Don’t Use AWS For
SaaS Applications – Pagerduty, AppdynamicsContent Delivery Service
DNS Service
CDN Scale
AWS CloudFrontAkamai
LimelightLevel 3
Netflix Openconnect
YouTube
Gigabits Terabits
NetflixFacebookStartups
Content Delivery ServiceOpen Source Hardware Design + FreeBSD, bird, nginx
see openconnect.netflix.com
DNS Service
AWS Route53 is missing too many featuresMultiple vendor strategy Dyn, Ultra, Route53
Abstracted (broken) DNS APIs with Denominator
Availability Questions
Is it running yet?How many places is it running in?How far apart are those places?
Netflix Outages
• Running very fast with scissors– Mostly self inflicted – bugs, mistakes from pace of change– Some caused by AWS bugs and mistakes
• Incident Life-cycle Management by Platform Team– No runbooks, no operational changes by the SREs– Tools to identify what broke and call the right developer
• Next step is multi-region– Investigating and building in stages during 2013– Could have prevented some of our 2012 outages
Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNSDynECT
DNS
AWS Route53
Denominator – manage traffic via multiple DNS providers
Denominator
Cloud Native Big Data
Size the cluster to the dataSize the cluster to the questionsNever wait for space or answers
Netflix Dataoven
Data WarehouseOver 2 Petabytes
Ursula
Aegisthus
Data Pipelines
From cloud Services~100 BillionEvents/day
From C*Terabytes ofDimensiondata
Hadoop Clusters – AWS EMR
1300 nodes 800 nodes Multiple 150 nodes Nightly
RDS
Metadata
Gateways
Tools
Cloud Native Patterns
Master copies of data are cloud residentDynamically provisioned micro-servicesServices are distributed and ephemeral
Cloud Native Architecture
Distributed Quorum NoSQL Datastores
Autoscaled Micro Services
Autoscaled Micro Services
Clients Things
JVM JVM
JVM JVM
Cassandra Cassandra Cassandra
Memcached
JVM
Zone A Zone B Zone C
Non-Native Cloud Architecture
Datacenter Dinosaurs
Cloudy Buffer
Agile Mobile Mammals
iOS/Android
App Servers
MySQL Legacy Apps
How to get to Cloud Native?
Freedom and Responsibility for DevelopersDecentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
Re-Org!
Four Transitions
• Management: Integrated Roles in a Single Organization– Business, Development, Operations -> BusDevOps
• Developers: Denormalized Data – NoSQL– Decentralized, scalable, available, polyglot
• Responsibility from Ops to Dev: Continuous Delivery– Decentralized small daily production updates
• Responsibility from Ops to Dev: Agile Infrastructure - Cloud– Hardware in minutes, provisioned directly by developers
Netflix BusDevOps OrganizationChief Product
Officer
VP Product Management
Directors Product
VP UI Engineering
Directors Development
Developers + DevOps
UI Data Sources
AWS
VP Discovery Engineering
Directors Development
Developers + DevOps
Discovery Data Sources
AWS
VP Platform
Directors Platform
Developers + DevOps
Platform Data Sources
AWS
Denormalized, independently updated and scaled data
Cloud, independently updated and scaled infrastructure
Code, independently updated continuous delivery
Decentralized Deployment
Asgardhttp://techblog.netflix.com/2012/06/asgard-web-based-cloud-management-and.html
Ephemeral Instances
• Largest services are autoscaled• Average lifetime of an instance is 36 hours
Push
Autoscale UpAutoscale Down
A Cloud Native Open Source PlatformSee netflix.github.com
Three Questions
Why is Netflix doing this?
How does it all fit together?
What is coming next?
Beware of Geeks Bearing Gifts: Strategies for an Increasingly Open Economy
Simon Wardley - Researcher at the Leading Edge Forum
How did Netflix get ahead?
Netflix BusDevOps Org• Doing it since 2009• SaaS Applications• PaaS for agility• Public IaaS for AWS features• Big data in the cloud• Integrating many APIs• FOSS from github• Renting hardware for 1hr• Coding in Java/Groovy/Scala
Traditional IT Operations• Taking their time• Pilot private cloud projects• Beta quality installations• Small scale• Integrating several vendors• Paying big $ for software• Paying big $ for consulting• Buying hardware for 3yrs• Hacking at scripts
Netflix Platform Evolution
Bleeding Edge Innovation
Common Pattern
Shared Pattern
2009-2010 2011-2012 2013-2014
Netflix ended up several years ahead of the industry, but it’s becoming commoditized now
Making it easy to follow
Exploring the wild west each time vs. laying down a shared route
Establish our solutions as Best
Practices / Standards
Hire, Retain and Engage Top Engineers
Build up Netflix Technology Brand
Benefit from a shared ecosystem
Goals
How does it all fit together?
Example Application – RSS Reader
GithubNetflixOSS
Source
AWSBase AMI
MavenCentral
Cloudbees Jenkins
AminatorBakery
DynaslaveAWS Build
Slaves
Asgard(+ Frigga)Console
AWSBaked AMIs
OdinOrchestration
API
AWS Account
NetflixOSS Continuous Build and Deployment
Coming Soon!
AWS AccountAsgard Console
Archaius Config Service
Cross region Priam C*
PytheasDashboards
AtlasMonitoring
Genie, LipstickHadoop Services
AWS UsageCost Monitoring
Multiple AWS RegionsEureka Registry
Exhibitor ZK
Edda History
Simian Army
3 AWS ZonesApplication
ClustersAutoscale Groups
Instances
PriamCassandra
Persistent Storage
EvcacheMemcached
Ephemeral Storage
NetflixOSS Services Scope
• Baked AMI – Tomcat, Apache, your code• Governator – Guice based dependency injection• Archaius – dynamic configuration properties client• Eureka - service registration client
Initialization
• Karyon - Base Server for inbound requests• RxJava – Reactive pattern• Hystrix/Turbine – dependencies and real-time status• Ribbon - REST Client for outbound calls
Service Requests
• Astyanax – Cassandra client and pattern library• Evcache – Zone aware Memcached client• Curator – Zookeeper patterns• Denominator – DNS routing abstraction
Data Access
• Blitz4j – non-blocking logging• Servo – metrics export for autoscaling• Atlas – high volume instrumentation
Logging
NetflixOSS Instance Libraries
• CassJmeter – Load testing for Cassandra• Circus Monkey – Test account reservation rebalancingTest Tools
• Janitor Monkey – Cleans up unused resources• Efficiency Monkey• Doctor Monkey• Howler Monkey – Complains about AWS limits
Maintenance
• Chaos Monkey – Kills Instances• Chaos Gorilla – Kills Availability Zones• Chaos Kong – Kills Regions• Latency Monkey – Latency and error injection
Availability
• Security Monkey – security group and S3 bucket permissions• Conformity Monkey – architectural pattern warningsSecurity
NetflixOSS Testing and Automation
More Use Cases
More Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contributions from vendors
What’s Coming Next?
Vendor Driven PortabilityInterest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”Functionally completeDemonstrated MarchRelease 3.3 in 2Q13
Some vendor interestNeeds AWS compatible Autoscaler
Some vendor interestMany missing features“Confused” AWS API strategy
AWS 2009Baseline features needed to support NetflixOSS
Eucalyptus 3.3
Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud native ecosystem
Rapid Evolution - Low MTBIAMSH(Mean Time Between Idea And Making Stuff Happen)
Takeaway
NetflixOSS makes it easier for everyone to become Cloud Native
http://netflix.github.comhttp://techblog.netflix.comhttp://slideshare.net/Netflix
http://www.linkedin.com/in/adriancockcroft
@adrianco #netflixcloud @NetflixOSS
top related