google sre (site reliability engineering) concepts

1
Google SRE (Site Reliability Engineering) Concepts Presenters David Hixson John Neil Robert Spier John Reese Mikey Dickerson Nori Heikkinen Ryan Anderson Designing Distributed Systems Limiting Factors What limits growth? Resource constraints How to design to push past limits Data latency Failure Modes Predict far in future Ways things will fail Hope is not a strategy Serve in spite of failures How to setve & grow past failures What can you orevent before it starts? 10 Rules for Scale Scaling Up Safely Make Good Choices Constraints Every part of design has limits Aggregate capability is probably the minimum All capacity above that value is wasted The smallest limit is the failure domain Gas tank size - car will run out of fuel first: failure domain Understand the Whole Stack Components in comp Disk/iops Components in data center Network Ports Rack uplinks Components at data center Cooling Wan connection Power Do you concentrate trac into smaller failure domains? Next most critical decision Costs and alternatives Understand the risks Time spent evaluating is another risk! Decide Reassess Not too often Things change fast 3 choices / pick 2 Cheap Fast Reliable Engineering decisions are also driven by things outside engineering Product design limits Management directives Capacity Planning What are important things to think about? # users #viewers #searches Dependent requests & subqueries #request Most popular data Least popular What defines the service & it's capacity? Total request sent to entire system Total capacity per core Does it change for types of core? How long does it take to change the system? How much risk for failure is accounted for? How perfect is your load balancers? Planning cycle Estimate in theory the cost of the work Validate in practice the cost of work Monitor demand Monitor the work Identify improvements Caching Tuning Better code Product changes N+1 N is the capacity you need to serve at peak +1 = Shortcut for thinking about disaster capacity Expansion on anticipate the future from day 1 N+2 I need x resources to serve y trac 99% of the time Like Supply Chain Management Several cycles Cheapest safe choice Engineering Tricks 1. Dark Launches Gain experience without the suering New caching? New image replication? Avoid embarassment Build better estimates before public releases Identify bottlenecks Work on optimizing Turn on backend monitoring of features before making features visible to end users Collect/analyze all data you would monitor if live 2. Degraded Failure (success) mode What choices do you have if the system approaches a critical state? Can you reduce load? Serve lower quality images? Dierence in what work you can do at 1 qps vs 1mil qps Don't accept it if it will make you fail R2D2 is oered one more shot of whiskey... Program him to kindly say no thank you when he's reached his limit 3. Monitoring Can't fix what you can't measure Types of monitoring Black box Monitoring what it is supposed to do External monitoring Limited knowledge of "how it works" Responsive White box Predictive of failures "Approaching peak" Predictive of what interventions will fix it Manual interventions (email Sal with instructions) Automated repair responses Beyond garbage collection Responsive to failures Detailed understanding of the system Identified critical thresholds Warning of approaching thresholds Transparent from day 1 Failure is not an option... But it's going to happen anyway You have to have a way to reason about your system What happens when a piece of your system goes away? What are the implications? What other systems absorb the impact? If it's too big to reason in your head, you need a tool to visualize Be able to visualize your system in realtime If you do something a lot "really rare" becomes twice a day Use good sources of uniqueness Clean up temporary files Validate your config files before you push them Test all layers of a system Humans can't review everything Automated tests are the only way to operate at scale Error paths need to be exercised regularly Even in production Always have safety checks for your automated pushes Things that are unthinkable are therefore undocumented Perfectly reasonable code can become a trap Document assumptions in the code Check assumptions when you use a library What % of data is aected by an automated push? If greater than some % place in holding pattern for review 1% is a whole freaking lot at scale Avoiding syncronication is important Small outages become bigger rapidly On error don't retry immediately Add exponential wait Add jitter Don't schedule tasks on hour or on half hour Make it random 1. KISS - Keep servers simple Do one thing and do it well Don't mix request types in a single server Growth Limiting My_app_server Handles image uploads Serves image thumbnails Mix of requests can change Capacity unpredictable for mix of services Growth Potential My_app_upload_server My_app_thumbnail_server Consistent behavior/capacity per setver Easy to understand Tons of requests from a variety of systems 2. Smaller & Stateless Prefer smaller stateless servers Many small jobs vs one big job Stateful jobs vs stateless jobs Stateful A stateful server remembers client data (state) from one request to the next. A stateful server is simpler Stateless A stateless server keeps no state information stateless server is more robust lost connections can't leave a file in an invalid state rebooting the server does not lose state information rebooting the client does not confuse a stateless server Using a stateless file server, the client must specify complete file names in each request specify location for reading or writing re-authenticate for each request Using a stateful file server, the client can send less data with each request Sticky sessions vs stateless sessions Sticky sessions Locking a session to a server to maintain identification of session and it's state load balancer is forced to send all the requests to their original server where the session state was created even though that server might be heavily loaded and there might be another less-loaded server available to take on this request Stateless session the server does not need to store any session state all necessary information is stored in the cookie held by the client load balancing is easier, as session state does not need to be replicated over multiple front-end servers Make failure domains smallest & fewest Growth Limiting One giant db server All photos on one server Failure point 3K QPS Growth Potential Many smaller sharded storage db servers Range of photo ids spread across servers Cache document state on servers Failure point 1K QPS 1K QPS 1K QPS 3. Retry Safely Growth Limiting Retry 3 times w 3 second delay Demand oscilation may occur Growth Potential Retry w random exponential back oRandom back omisplaces requests So they don't line up when backed up Make sure requests don't exceed dependent system timeouts Stateless Ensure clients send identifying info to server 4. Bound Resource Usage: Fail Gracefully Growth Limiting Load entire objects or docs into RAM Error:connection timeout Growth Potential Operate on chunks of data 10 thimnbnails instead of 20 per page Consider data structure carefully Don't buer user input w/o a limit Reject user requests if overloaded 5. Don't Crash/Assert Exit Never die due to unexpected input Send an exception response Just throw the request away and ignore it Growth Limiting Assert(request.size<=1000) Growth Potential Request size > 1000 (request too big) 6. Be Transparent Jobs should not be a black box Keep track of actions taken Make it available Export it Visible via private url Provide visibility of internal state Provide explicit statement of health Load balancers can use this to send trac elsewhere Key value pairs Config files Provide debug pages for conplex data Mechanism for doing health checks Can i read my config file Connection to db? Memory used? Cpu? Errors sent to backend? 7. Avoid Lazy Initialization Prepare everything you need at startup Perform all health checks Before accepting requests Include db connections Loading files to disk etc 8. Maintain Flexibility Don't change the world at once Canarying experimental rollouts Release schedule & qa testing Don't release at peak Don't aect users Do it when workers can respond Don't release at midnight New features? Config protected Disabled by default Percentage rollout AB testing 9. Anticipate the Future Growth trends Watch them Have safety buer Real disaster: Taiwan floods caused a global hard drive supply delay Plan for more capacity if needed Consider time to order Time to implement Consider growing Peaks Industry changes New technology? Bigger images? New upload bandwidth requirements 10. Check the User Experience Fast & Reliable Fast results = more users Slow performance = drop in user % Measurable week over week Probe onetwork Emulate real users Automate it Selenium Bandwidth avail Latency Don't just check servers Mind Mapped by Ayori Selassie Find me on Twitter @iayori Hosted at blacksintechnology.net

Upload: dinhanh

Post on 01-Jan-2017

229 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Google SRE (Site Reliability Engineering) Concepts

Google SRE (Site Reliability Engineering) Concepts

Presenters

David Hixson

John Neil

Robert Spier

John Reese

Mikey Dickerson

Nori Heikkinen

Ryan Anderson

Designing Distributed Systems

Limiting Factors

What limits growth?

Resource constraints

How to design topush past limits

Data latency

Failure Modes

Predict far in future

Ways things will fail

Hope is not a strategy

Serve in spite of failures

How to setve &grow past failures

What can you oreventbefore it starts?

10 Rules for Scale

Scaling Up Safely

Make Good Choices

Constraints

Every part ofdesign has limits

Aggregate capability isprobably the minimum

All capacity above thatvalue is wasted

The smallest limit isthe failure domain

Gas tank size - car will run out of fuelfirst: failure domain

Understand theWhole Stack

Components in comp Disk/iops

Components indata center

Network

Ports

Rack uplinks

Components atdata center

Cooling

Wan connection

Power

Do you concentrate traffic intosmaller failure domains?

Next most critical decision

Costs and alternatives

Understand the risks Time spent evaluatingis another risk!

Decide

Reassess

Not too often

Things change fast

3 choices / pick 2

Cheap

Fast

Reliable

Engineering decisions are also driven bythings outside engineering

Product design limits

Management directives

Capacity Planning

What are importantthings to think about?

# users

#viewers

#searches Dependent requests & subqueries

#request

Most popular data

Least popular

What defines theservice & it's capacity?

Total request sent toentire system

Total capacity per core Does it change fortypes of core?

How long does it take tochange the system?

How much risk for failureis accounted for?

How perfect is yourload balancers?

Planning cycle

Estimate in theory thecost of the work

Validate in practicethe cost of work

Monitor demand

Monitor the work

Identify improvements

Caching

Tuning

Better code

Product changes

N+1

N is the capacity youneed to serve at peak

+1 = Shortcut for thinking aboutdisaster capacity

Expansion on anticipatethe future from day 1

N+2

I need x resources to serve ytraffic 99% of the time

Like Supply ChainManagement

Several cycles

Cheapest safe choice

Engineering Tricks

1. Dark Launches Gain experiencewithout the suffering

New caching?

New image replication?

Avoid embarassment

Build better estimatesbefore public releases

Identify bottlenecks

Work on optimizing

Turn on backend monitoring offeatures before making features

visible to end users

Collect/analyze all data youwould monitor if live

2. Degraded Failure(success) mode

What choices do you have if the systemapproaches a critical state?

Can you reduce load? Serve lowerquality images?

Difference in what work you cando at 1 qps vs 1mil qps

Don't accept it if itwill make you fail

R2D2 is offered one moreshot of whiskey... Program him to kindly say no thank you when he's reached his limit

3. Monitoring Can't fix what youcan't measure Types of monitoring

Black box

Monitoring what it issupposed to do

External monitoring

Limited knowledge of"how it works"

Responsive

White box

Predictive of failures

"Approaching peak"

Predictive of whatinterventions will fix it

Manual interventions (emailSal with instructions)

Automatedrepair responses

Beyondgarbage

collection

Responsive to failures

Detailedunderstanding of the

system

Identified criticalthresholds

Warning ofapproachingthresholds

Transparent from day 1

Failure is notan option...

But it's going tohappen anyway

You have to have a way toreason about your system

What happens when a piece ofyour system goes away?

What are theimplications?

What other systemsabsorb the impact?

If it's too big to reason in your head,you need a tool to visualize

Be able to visualize yoursystem in realtime

If you do something a lot "reallyrare" becomes twice a day

Use good sourcesof uniqueness

Clean uptemporary

files

Validate your config filesbefore you push them

Test all layersof a system

Humans can'treview everything

Automated tests are the onlyway to operate at scale

Error paths need to beexercised regularly Even in production

Always have safety checks foryour automated pushes

Things that are unthinkable aretherefore undocumented

Perfectly reasonable codecan become a trap

Documentassumptions in the

code

Check assumptions whenyou use a library

What % of data is affected byan automated push?

If greater than some % place inholding pattern for review

1% is a wholefreaking lot at scale

Avoidingsyncronication is

important

Small outagesbecome bigger

rapidly

On error don'tretry immediately Add exponential wait Add jitter

Don't scheduletasks on hour or on half hour Make it random

1. KISS - Keepservers simple

Do one thing and do it well

Don't mix request typesin a single server

Growth Limiting

My_app_server

Handles image uploads

Serves image thumbnails

Mix of requestscan change

Capacity unpredictablefor mix of services

Growth Potential

My_app_upload_server

My_app_thumbnail_server

Consistentbehavior/capacity per

setver

Easy to understand

Tons of requests from avariety of systems

2. Smaller & Stateless

Prefer smallerstateless servers

Many small jobsvs one big job

Stateful jobs vsstateless jobs

Stateful

A stateful server remembersclient data (state) from one

request to the next.

A statefulserver issimpler

Stateless

A stateless server keepsno state information

stateless server is more robust

lost connections can't leave afile in an invalid state

rebooting the server does notlose state information

rebooting the client does not confuse a stateless server

Using a stateless file server, the client must specify complete file names in each request

specify location for reading or writingre-authenticate for each request

Using a stateful file server,the client can send less data

with each request

Sticky sessions vsstateless sessions

Sticky sessions

Locking a session to a server to maintain identification of session and it's state

load balancer is forced to send all therequests to their original server where the session state was

created

even though that server might be heavily loaded andthere might be another less-loaded server available to take

on this request

Stateless session

the server does not need tostore any session state

all necessary information is storedin the cookie held by the client

load balancing is easier, as session statedoes not need to be replicated over

multiple front-end servers

Make failure domainssmallest & fewest

Growth Limiting

One giant db server

All photos on one server

Failure point3K QPS

Growth Potential

Many smaller shardedstorage db servers

Range of photo idsspread across servers

Cache documentstate on servers

Failure point

1K QPS

1K QPS

1K QPS

3. Retry SafelyGrowth Limiting

Retry 3 times w 3second delay

Demandoscilation may

occur

Growth PotentialRetry w randomexponential back off

Random back offmisplaces requests

So they don't line upwhen backed up

Make sure requests don't exceeddependent system timeoutsStatelessEnsure clients send

identifying info to server

4. Bound ResourceUsage: Fail Gracefully

Growth Limiting

Load entire objects ordocs into RAM

Error:connection timeout

Growth Potential

Operate on chunks of data

10 thimnbnailsinstead of 20 per page

Consider datastructure carefully

Don't buffer userinput w/o a limit

Reject user requests ifoverloaded

5. Don't Crash/Assert Exit

Never die due tounexpected input

Send anexceptionresponse

Just throw the request away and ignore itGrowth Limiting

Assert(request.size<=1000)

Growth PotentialRequest size > 1000(request too big)

6. Be Transparent

Jobs should notbe a black box

Keep track ofactions takenMake it available

Export it

Visible via private url

Provide visibility ofinternal state

Provide explicitstatement of health

Load balancers can use this tosend traffic elsewhere

Key value pairsConfig files

Provide debug pagesfor conplex data

Mechanism for doinghealth checks

Can i read my config file

Connection to db?

Memory used?

Cpu?

Errors sent to backend?

7. Avoid Lazy Initialization

Prepare everything youneed at startup

Perform all health checks

Before accepting requests

Include db connections

Loading files to disk etc

8. Maintain Flexibility

Don't change theworld at once

Canaryingexperimental

rollouts

Release schedule& qa testing

Don't release at peak

Don't affect users

Do it when workerscan respondDon't release at midnight

New features?

Config protectedDisabled by default

Percentage rolloutAB testing

9. Anticipate the Future

Growth trends

Watch them

Have safety bufferReal disaster: Taiwan floods caused a globalhard drive supply delay

Plan for morecapacity if needed

Consider time to orderTime to implement

Consider growing Peaks

Industry changesNew technology?Bigger images?New uploadbandwidth

requirements

10. Check theUser Experience

Fast & Reliable

Fast results = more users

Slow performance =drop in user %

Measurableweek over week

Probe off network

Emulate real usersAutomate itSelenium

Bandwidth avail

Latency

Don't just check servers

Mind Mapped byAyori Selassie

Find me onTwitter @iayori

Hosted atblacksintechnology.net