Download - SRE From Scratch
SRE FROM SCRATCH
SITE RELIABILITY ENGINEERING
PRODUCTION ENGINEERING
DEVOPS?
WHAT DO SRE DO?
KEEP THE SITE UP
KNOW THE PRODUCTION ENVIRONMENT
KNOW THEIR PRODUCT
LIAISON, ADVISOR, CONSULTANT
TOOLING AND AUTOMATION
TRIAGE
SO? WHY DO I NEED THEM?
UPTIME
THE ENVIRONMENT IS A PRODUCT
THEY’VE DONE THIS BEFORE
OK... LET’S HIRE SOME
WHAT TO LOOK FOR...
SRES!
SYSADMINS THAT PROGRAM
PROGRAMMERS THAT DO SYSADMIN
EXPERIENCE WITH SCALE
HOW DO I INTERVIEW THEM?
FUNDAMENTALS
HARDWARE
SYSTEM INTERNALS
UNIX ENVIRONMENT
NETWORKING
APPLICATION SUPPORT
OPERATING AT SCALE
PROGRAMMING
DON’T HIRE HEROES
OK, I’VE HIRED SOME, WHAT SHOULD THEY DO?
DESIGN REVIEW
DATA FLOWS
DEPENDENCIES
FAILURE CONDITIONS
SCALING
LAUNCH PREPAREDNESS
DOCUMENTATION
BUILD INFRASTRUCTURE
MONITORING
DEPLOYMENT
OPERATOR TOOLS
CONFIGURATION MANAGEMENT
SELF-SERVICE
HOW SHOULD THE TEAMS INTERACT...
DON’T GIVE ALL THE DAY-TO-DAY TASKS TO THE SRES
SHARE THE LOAD
HAVE YOUR SRES SIT WITH YOU
INCLUDE THEM IN DISCUSSIONS THE AFFECT THE PRODUCTION ENVIRONMENT
SOFTWARE IS NEVER THROWN OVER THE WALL
HAND-OFFS
SRES SHOULD BLOCK DANGEROUS CHANGES
IF YOUR SRES ARE FIGHTING FIRES, THEY’RE NOT BUILDING
INFRASTRUCTURE
IF YOUR SOFTWARE IS CAUSING FIRES, FIX IT
ASK YOUR SRE TO HELP MAKE FLAME-PROOF SOFTWARE
DON’T HIDE YOUR PROBLEMS FROM SRE
SRE SHOULD BE INVOLVED TO UNDERSTAND THE PROBLEM
EVERYONE SHOULD BE WRITING CODE OR MAKING
HARD DECISIONS
OF COURSE THERE ARE OPTIONS...
SRE CAN DO ALL THE SUPPORT
SRES ARE A LIMITED RESOURCE
SWE CAN SUPPORT PRODUCTS...
APP SUPPORT BY SWE, INFRASTRUCTURE SUPPORT
BY SRE
OR JUST ROTATE AROUND
ANY PRODUCTION ADVICE?
SELF-SERVICE
ALL TOOLS SHOULD BE WRITTEN WITH THE IDEA THAT
ROBOTS CAN RUN THEM
BEFORE ROBOTS RUN THEM, ANYONE IN THE COMPANY
SHOULD BE ABLE TO
PEOPLE SHOULD MAKE HARD DECISIONS, NOT PUSH
BUTTONS
GIVE PEOPLE ACCESS
SWE SHOULD HAVE AS MUCH ACCESS AS THEY NEED.
SWE ALREADY WRITES CODE THAT HAS ACCESS TO
SENSITIVE DATA
PRODUCTION DATA STAYS IN PRODUCTION
MAKE GOOD SYNTHETIC DATA
MAKE GOOD WAYS TO TEST IN PROD
CANARY, A/B TEST, ETC.
LEARN TO TRIAGE
THINGS BREAK, YOU MUST FIX THEM
MONITORING, METRICS, OPERATOR TOOLS, FAST
BUILD AND DEPLOY
TO FIX, YOU NEED TO KNOW IT’S BROKEN
MONITORING
MONITOR APPLICATIONS
MONITOR BEHAVIOR
STANDARDIZE YOUR METRICS
PUSH METRICS OUT
DECOUPLE YOUR SYSTEMS
WATCH SYSTEMS AS A FUNCTION OF CAPACITY
ONLY ALERT ON SYSTEM METRICS KNOWN TO HURT
YOU
DATA STORES
BEWARE THE RDBMS
LEARN TO SHARD
DITCH THE DURABILITY WHERE YOU CAN
BUT FIGURE OUT HOW TO BOOTSTRAP NON-DURABLE
STORES
MEMCACHE IS A BLESSING AND A CURSE
ALWAYS CONSIDER A SITE-WIDE POWER OUTAGE
USE DURABLE AND NON-DURABLE STORES TOGETHER
ASK YOUR SRE FOR MORE INFO
DESPITE ALL THIS, YOU CAN STILL FAIL...
OBVIOUS FAILURE
DOWNTIME
DOWNTIME WITHOUT KNOWING
NON-OBVIOUS FAILURES
HEROIC ACTS
WERE YOU UP ALL NIGHT?
DID YOU DO THAT SAME TASK ALL DAY?
DID A WHOLE TEAM STOP WHAT THEY WERE DOING?
THESE ARE HEROIC ACTS, THEY ARE POISON
HEROISM = FAILURE
COMES FROM LEGACY SYSTEMS, PROCEDURES
ALSO FROM PERSONALITY TRAITS...
QUESTIONS?
• Grier Johnson
• @grierj