google sre (site reliability engineering) concepts
TRANSCRIPT
Google SRE (Site Reliability Engineering) Concepts
Presenters
David Hixson
John Neil
Robert Spier
John Reese
Mikey Dickerson
Nori Heikkinen
Ryan Anderson
Designing Distributed Systems
Limiting Factors
What limits growth?
Resource constraints
How to design topush past limits
Data latency
Failure Modes
Predict far in future
Ways things will fail
Hope is not a strategy
Serve in spite of failures
How to setve &grow past failures
What can you oreventbefore it starts?
10 Rules for Scale
Scaling Up Safely
Make Good Choices
Constraints
Every part ofdesign has limits
Aggregate capability isprobably the minimum
All capacity above thatvalue is wasted
The smallest limit isthe failure domain
Gas tank size - car will run out of fuelfirst: failure domain
Understand theWhole Stack
Components in comp Disk/iops
Components indata center
Network
Ports
Rack uplinks
Components atdata center
Cooling
Wan connection
Power
Do you concentrate traffic intosmaller failure domains?
Next most critical decision
Costs and alternatives
Understand the risks Time spent evaluatingis another risk!
Decide
Reassess
Not too often
Things change fast
3 choices / pick 2
Cheap
Fast
Reliable
Engineering decisions are also driven bythings outside engineering
Product design limits
Management directives
Capacity Planning
What are importantthings to think about?
# users
#viewers
#searches Dependent requests & subqueries
#request
Most popular data
Least popular
What defines theservice & it's capacity?
Total request sent toentire system
Total capacity per core Does it change fortypes of core?
How long does it take tochange the system?
How much risk for failureis accounted for?
How perfect is yourload balancers?
Planning cycle
Estimate in theory thecost of the work
Validate in practicethe cost of work
Monitor demand
Monitor the work
Identify improvements
Caching
Tuning
Better code
Product changes
N+1
N is the capacity youneed to serve at peak
+1 = Shortcut for thinking aboutdisaster capacity
Expansion on anticipatethe future from day 1
N+2
I need x resources to serve ytraffic 99% of the time
Like Supply ChainManagement
Several cycles
Cheapest safe choice
Engineering Tricks
1. Dark Launches Gain experiencewithout the suffering
New caching?
New image replication?
Avoid embarassment
Build better estimatesbefore public releases
Identify bottlenecks
Work on optimizing
Turn on backend monitoring offeatures before making features
visible to end users
Collect/analyze all data youwould monitor if live
2. Degraded Failure(success) mode
What choices do you have if the systemapproaches a critical state?
Can you reduce load? Serve lowerquality images?
Difference in what work you cando at 1 qps vs 1mil qps
Don't accept it if itwill make you fail
R2D2 is offered one moreshot of whiskey... Program him to kindly say no thank you when he's reached his limit
3. Monitoring Can't fix what youcan't measure Types of monitoring
Black box
Monitoring what it issupposed to do
External monitoring
Limited knowledge of"how it works"
Responsive
White box
Predictive of failures
"Approaching peak"
Predictive of whatinterventions will fix it
Manual interventions (emailSal with instructions)
Automatedrepair responses
Beyondgarbage
collection
Responsive to failures
Detailedunderstanding of the
system
Identified criticalthresholds
Warning ofapproachingthresholds
Transparent from day 1
Failure is notan option...
But it's going tohappen anyway
You have to have a way toreason about your system
What happens when a piece ofyour system goes away?
What are theimplications?
What other systemsabsorb the impact?
If it's too big to reason in your head,you need a tool to visualize
Be able to visualize yoursystem in realtime
If you do something a lot "reallyrare" becomes twice a day
Use good sourcesof uniqueness
Clean uptemporary
files
Validate your config filesbefore you push them
Test all layersof a system
Humans can'treview everything
Automated tests are the onlyway to operate at scale
Error paths need to beexercised regularly Even in production
Always have safety checks foryour automated pushes
Things that are unthinkable aretherefore undocumented
Perfectly reasonable codecan become a trap
Documentassumptions in the
code
Check assumptions whenyou use a library
What % of data is affected byan automated push?
If greater than some % place inholding pattern for review
1% is a wholefreaking lot at scale
Avoidingsyncronication is
important
Small outagesbecome bigger
rapidly
On error don'tretry immediately Add exponential wait Add jitter
Don't scheduletasks on hour or on half hour Make it random
1. KISS - Keepservers simple
Do one thing and do it well
Don't mix request typesin a single server
Growth Limiting
My_app_server
Handles image uploads
Serves image thumbnails
Mix of requestscan change
Capacity unpredictablefor mix of services
Growth Potential
My_app_upload_server
My_app_thumbnail_server
Consistentbehavior/capacity per
setver
Easy to understand
Tons of requests from avariety of systems
2. Smaller & Stateless
Prefer smallerstateless servers
Many small jobsvs one big job
Stateful jobs vsstateless jobs
Stateful
A stateful server remembersclient data (state) from one
request to the next.
A statefulserver issimpler
Stateless
A stateless server keepsno state information
stateless server is more robust
lost connections can't leave afile in an invalid state
rebooting the server does notlose state information
rebooting the client does not confuse a stateless server
Using a stateless file server, the client must specify complete file names in each request
specify location for reading or writingre-authenticate for each request
Using a stateful file server,the client can send less data
with each request
Sticky sessions vsstateless sessions
Sticky sessions
Locking a session to a server to maintain identification of session and it's state
load balancer is forced to send all therequests to their original server where the session state was
created
even though that server might be heavily loaded andthere might be another less-loaded server available to take
on this request
Stateless session
the server does not need tostore any session state
all necessary information is storedin the cookie held by the client
load balancing is easier, as session statedoes not need to be replicated over
multiple front-end servers
Make failure domainssmallest & fewest
Growth Limiting
One giant db server
All photos on one server
Failure point3K QPS
Growth Potential
Many smaller shardedstorage db servers
Range of photo idsspread across servers
Cache documentstate on servers
Failure point
1K QPS
1K QPS
1K QPS
3. Retry SafelyGrowth Limiting
Retry 3 times w 3second delay
Demandoscilation may
occur
Growth PotentialRetry w randomexponential back off
Random back offmisplaces requests
So they don't line upwhen backed up
Make sure requests don't exceeddependent system timeoutsStatelessEnsure clients send
identifying info to server
4. Bound ResourceUsage: Fail Gracefully
Growth Limiting
Load entire objects ordocs into RAM
Error:connection timeout
Growth Potential
Operate on chunks of data
10 thimnbnailsinstead of 20 per page
Consider datastructure carefully
Don't buffer userinput w/o a limit
Reject user requests ifoverloaded
5. Don't Crash/Assert Exit
Never die due tounexpected input
Send anexceptionresponse
Just throw the request away and ignore itGrowth Limiting
Assert(request.size<=1000)
Growth PotentialRequest size > 1000(request too big)
6. Be Transparent
Jobs should notbe a black box
Keep track ofactions takenMake it available
Export it
Visible via private url
Provide visibility ofinternal state
Provide explicitstatement of health
Load balancers can use this tosend traffic elsewhere
Key value pairsConfig files
Provide debug pagesfor conplex data
Mechanism for doinghealth checks
Can i read my config file
Connection to db?
Memory used?
Cpu?
Errors sent to backend?
7. Avoid Lazy Initialization
Prepare everything youneed at startup
Perform all health checks
Before accepting requests
Include db connections
Loading files to disk etc
8. Maintain Flexibility
Don't change theworld at once
Canaryingexperimental
rollouts
Release schedule& qa testing
Don't release at peak
Don't affect users
Do it when workerscan respondDon't release at midnight
New features?
Config protectedDisabled by default
Percentage rolloutAB testing
9. Anticipate the Future
Growth trends
Watch them
Have safety bufferReal disaster: Taiwan floods caused a globalhard drive supply delay
Plan for morecapacity if needed
Consider time to orderTime to implement
Consider growing Peaks
Industry changesNew technology?Bigger images?New uploadbandwidth
requirements
10. Check theUser Experience
Fast & Reliable
Fast results = more users
Slow performance =drop in user %
Measurableweek over week
Probe off network
Emulate real usersAutomate itSelenium
Bandwidth avail
Latency
Don't just check servers
Mind Mapped byAyori Selassie
Find me onTwitter @iayori
Hosted atblacksintechnology.net