sre from scratch

Post on 16-Nov-2014

2.016 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

How to bootstrap an SRE team into your company. How to hire them, what to have them work on and how to interact with them as a team. Finally some thought on general practices to consider before your SREs arrive. There are also kitten pictures.

TRANSCRIPT

SRE FROM SCRATCH

SITE RELIABILITY ENGINEERING

PRODUCTION ENGINEERING

DEVOPS?

WHAT DO SRE DO?

KEEP THE SITE UP

KNOW THE PRODUCTION ENVIRONMENT

KNOW THEIR PRODUCT

LIAISON, ADVISOR, CONSULTANT

TOOLING AND AUTOMATION

TRIAGE

SO? WHY DO I NEED THEM?

UPTIME

THE ENVIRONMENT IS A PRODUCT

THEY’VE DONE THIS BEFORE

OK... LET’S HIRE SOME

WHAT TO LOOK FOR...

SRES!

SYSADMINS THAT PROGRAM

PROGRAMMERS THAT DO SYSADMIN

EXPERIENCE WITH SCALE

HOW DO I INTERVIEW THEM?

FUNDAMENTALS

HARDWARE

SYSTEM INTERNALS

UNIX ENVIRONMENT

NETWORKING

APPLICATION SUPPORT

OPERATING AT SCALE

PROGRAMMING

DON’T HIRE HEROES

OK, I’VE HIRED SOME, WHAT SHOULD THEY DO?

DESIGN REVIEW

DATA FLOWS

DEPENDENCIES

FAILURE CONDITIONS

SCALING

LAUNCH PREPAREDNESS

DOCUMENTATION

BUILD INFRASTRUCTURE

MONITORING

DEPLOYMENT

OPERATOR TOOLS

CONFIGURATION MANAGEMENT

SELF-SERVICE

HOW SHOULD THE TEAMS INTERACT...

DON’T GIVE ALL THE DAY-TO-DAY TASKS TO THE SRES

SHARE THE LOAD

HAVE YOUR SRES SIT WITH YOU

INCLUDE THEM IN DISCUSSIONS THE AFFECT THE PRODUCTION ENVIRONMENT

SOFTWARE IS NEVER THROWN OVER THE WALL

HAND-OFFS

SRES SHOULD BLOCK DANGEROUS CHANGES

IF YOUR SRES ARE FIGHTING FIRES, THEY’RE NOT BUILDING

INFRASTRUCTURE

IF YOUR SOFTWARE IS CAUSING FIRES, FIX IT

ASK YOUR SRE TO HELP MAKE FLAME-PROOF SOFTWARE

DON’T HIDE YOUR PROBLEMS FROM SRE

SRE SHOULD BE INVOLVED TO UNDERSTAND THE PROBLEM

EVERYONE SHOULD BE WRITING CODE OR MAKING

HARD DECISIONS

OF COURSE THERE ARE OPTIONS...

SRE CAN DO ALL THE SUPPORT

SRES ARE A LIMITED RESOURCE

SWE CAN SUPPORT PRODUCTS...

APP SUPPORT BY SWE, INFRASTRUCTURE SUPPORT

BY SRE

OR JUST ROTATE AROUND

ANY PRODUCTION ADVICE?

SELF-SERVICE

ALL TOOLS SHOULD BE WRITTEN WITH THE IDEA THAT

ROBOTS CAN RUN THEM

BEFORE ROBOTS RUN THEM, ANYONE IN THE COMPANY

SHOULD BE ABLE TO

PEOPLE SHOULD MAKE HARD DECISIONS, NOT PUSH

BUTTONS

GIVE PEOPLE ACCESS

SWE SHOULD HAVE AS MUCH ACCESS AS THEY NEED.

SWE ALREADY WRITES CODE THAT HAS ACCESS TO

SENSITIVE DATA

PRODUCTION DATA STAYS IN PRODUCTION

MAKE GOOD SYNTHETIC DATA

MAKE GOOD WAYS TO TEST IN PROD

CANARY, A/B TEST, ETC.

LEARN TO TRIAGE

THINGS BREAK, YOU MUST FIX THEM

MONITORING, METRICS, OPERATOR TOOLS, FAST

BUILD AND DEPLOY

TO FIX, YOU NEED TO KNOW IT’S BROKEN

MONITORING

MONITOR APPLICATIONS

MONITOR BEHAVIOR

STANDARDIZE YOUR METRICS

PUSH METRICS OUT

DECOUPLE YOUR SYSTEMS

WATCH SYSTEMS AS A FUNCTION OF CAPACITY

ONLY ALERT ON SYSTEM METRICS KNOWN TO HURT

YOU

DATA STORES

BEWARE THE RDBMS

LEARN TO SHARD

DITCH THE DURABILITY WHERE YOU CAN

BUT FIGURE OUT HOW TO BOOTSTRAP NON-DURABLE

STORES

MEMCACHE IS A BLESSING AND A CURSE

ALWAYS CONSIDER A SITE-WIDE POWER OUTAGE

USE DURABLE AND NON-DURABLE STORES TOGETHER

ASK YOUR SRE FOR MORE INFO

DESPITE ALL THIS, YOU CAN STILL FAIL...

OBVIOUS FAILURE

DOWNTIME

DOWNTIME WITHOUT KNOWING

NON-OBVIOUS FAILURES

HEROIC ACTS

WERE YOU UP ALL NIGHT?

DID YOU DO THAT SAME TASK ALL DAY?

DID A WHOLE TEAM STOP WHAT THEY WERE DOING?

THESE ARE HEROIC ACTS, THEY ARE POISON

HEROISM = FAILURE

COMES FROM LEGACY SYSTEMS, PROCEDURES

ALSO FROM PERSONALITY TRAITS...

QUESTIONS?

• Grier Johnson

• @grierj

• grierj@gmail.com

top related