site reliability engineering - usenix · availability and reliability meet slos • defend customer...
TRANSCRIPT
7/22/16 1
GregVeithDirector– MicrosoftAzureSRE
SiteReliabilityEngineering
They’re Alive!
7/22/16 2
Organizations Are Living Organisms
Evolution and Complexity
7/22/16 3
7/22/164
Azure Service Offerings
%Revenue from Startups
and ISVs
kNew Azure customer subscriptions/month Distinct Azure Service Offerings
Datacenters
24Datacenter Regions
Scale
MMessages per second
processed by Azure IoT
7/22/16 6
Transformation
7/22/16 7
Learning Culture, Growth Mindset
Scaling Up Operational Models
7/22/16 8
Welcome To The Team!
7/22/16 9
SR-
North is…
7/22/16 10
Symptoms of Success
7/22/16 11
• defendcustomertrustAvailabilityandreliabilitymeetSLOs
• ToileliminationEliminatehumantouchestoprod
• Reduceinventory,shipfast,safelySpeedupdeployments
Alltheaboveareasreinforcemeasurement.Reliability’sfoundation.
3 Strategic Pillars
7/22/16 12
Provethemodel– ApplyPrinciples
StartSREatMicrosoft– EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
7/22/16 13
SRE Engagement Types
Services at Planetary Scale
Newer Service Facing Rapid Growth
Greenfield Services or Redesign
SRE develops solutions to close operational gaps, fire suppressant, iterate toward transformation
SRE attaches to team, develops targeted improvements to prepare for growth, get on call
Operability and continuous innovation, design for scale from the beginning
Ops Transformation at Scale
Growth and Maturation
Design and Architecture
Production Readiness
7/22/16 14
3 Strategic Pillars
7/22/16 15
Provethemodel– Pilots– ApplyPrinciples
StartSREatMicrosoft- EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
Service Facing Rapid GrowthAzure IoT
7/22/16 16
Established Service at Planetary Scale Azure Storage
7/22/16 17
3 Prong Strategy
7/22/16 18
Provethemodel– Pilots– ApplyPrinciples
StartSREatMicrosoft- EstablishPrinciples
Accelerateandimprove– ScalethePrinciples
Production Virtuous Cycle
7/22/16 19
Goal:EnablethislooptorunasfastandoftenaspossiblewhilemaintainingSLOs
Code
Test
Deploy
Monitor,Measure,Alert
Mitigate
Restore
PostMortem
Learn
SRE
7/22/16 20
• Instrumentation,SLOs,Alarms,insightsà actionsMetricsandMonitoring
• Tooling,infraforglobaloptimaInfrastructureEngineering
• ChangeManagement,DeploymentReleaseEngineering
• EnoughSaidIncidentResponse
• Integratingexistingbestinclass infraCommonInfrastructure
• Buildout,decomm,fleetunderstandingandmgmtCapacity&FleetManagement
SRE Areas of Focus
Metrics and Monitoring
7/22/16 21
Incident Response
7/22/16 22
Critical Moves, LearningsBuildandprotecttheSREbrand
Managethechange
Meetteamswheretheyare
GrabaShovel(andbuildabackhoe)
Findthebrightspots
7/22/16 23