clearly, i have made some bad decisions

“I don’t have scaling problems”

Scaling is about change

not about quantity

Problems don’t occur when things are normal

If things change, you will have scaling problems

Work takes time to do


Email needs to be read


Email needs to be read

Code runs on a server

Mistake!“I don’t have scaling

problems”

Mistake!“I don’t have scaling

problems”Not a mistake we’re making if we’re here?

Mistakes will be made

Problems will happen

Mistakes will be made

Problems will happen

But there are things we can do to be prepared

#1 Measure Everything

How do you know if something is wrong?

How do you know if something is wrong?

not wrong?

# uptime 17:27:18 up 405 days, 2:36, 1 user, load average: 26.93, 10.46, 6.16

!?!?

# uptime 17:27:18 up 405 days, 2:36, 1 user, load average: 26.93, 10.46, 6.16

Read your log files

Read your log files(Exceptions aren’t always exceptional)

Measure in production (hat tip: Coda, “metrics, metrics everywhere”)

That’s the only place where things are really happening

Measure in production (hat tip: Coda, “metrics, metrics everywhere”)

That’s the only place where things are really happening

But don’t let your metrics causeperformance problems

PING web (192.168.19.1): 56 data bytesRequest timeout for icmp_seq 0Request timeout for icmp_seq 1Request timeout for icmp_seq 2Request timeout for icmp_seq 3

Sometimes you can just tell things are wrong

#2 Infrastructure as code(and config management)

Don’t do this.

Chef or Puppet(or cfengine or bcfg2)

Server config is code


Server config is codeRevision control



Feature branches



Feature branchesCommenting and authorship



Feature branchesCommenting and authorship Centralized

(not in someone’s head)

Should I choose Chef or Puppet?

Should I choose Chef or Puppet?

Yes(Seriously, this is non-negotiable.)

How do I switch my servers to start using config management?

My advice:build new ones, throw the old

ones away.

Clean Known state

test clustersBuild

Clean Known state

test clustersDestroy

Build

Clean Known state

test clusters

live machines

DestroyBuild

Build

Clean Known state

test clusters

live machines

DestroyBuild

Build

Use

Clean Known state

test clusters

live machines!

DestroyBuild

BuildUse

Destroy

One-button servers

What about your code?

#3a Real deployment

Don’t do this.

$ svn upU www/index.phpU www/payments.phpU www/settings-live.phpU www/settings-dev.phpA www/specials.php U .Updated to revision 9703.

Deployment is more than just putting code in place.


reproducible idempotent rollouts



tied to a known build number




with separately-versioned known configuration





triggered non-manually across any number of servers






with full dependency management






with full dependency management

and automated regression testing.

Etsy’s Deployinator

Vlad the Deployer

Fabric

Capistrano

OS Packages

Roll your own

#3b Continuous deployment

Holy Grailtrunk = live

tests block commits

feature flags?

dark launches?

Cowboy

vs

Perfectionist

Fast iteration = fast test results

One huge feature tested... and rejected

Ten new tiny features testedTwo accepted

Failure is comfortable

Blame out, responsibility in

Consequences immediately visible

Okay, fine:Continuous Integration

Things still go wrong

After all that

#4 Plan for failure

Take backups

Test backups

Automate servers

Test server crashes

Netflix’s Chaos Monkey

And cousins: the Simian Army

Server failures predicted and foiled

What about code? New features?

#5 Future Compatibility

ALTER TABLE `user` ADD COLUMN `twootr` VARCHAR(16);CREATE INDEX `twootr_idx` ON `user` (`twootr`);

Don’t do this.(on live)

“Future compatible” schemas

“Future compatible” code

Normalized tables are performance heavy

Don’t assume any columns?

Shiny new Yucky old

?

ReadWrite

Migrate

What about other bad decisions?

#6 Wing It

- Django- MySQL

spof.yola.com

Scheduled for reboot

- Django- MySQL

spof.yola.com

- MySQL

Slave Replication

- Django- MySQL

spof.yola.com

- MySQL

- Django

spof.yola.com

- Django

- MySQL - MySQL

Slave replication

- Django

spof.yola.com

- Django

- MySQL - MySQL

LB

Slave replication

- Django

spof.yola.com

- Django

- MySQL - MySQL

LB

Slave replication

Drop DNS TTL

- Django

spof.yola.com

- Django

- MySQL - MySQL

LB

Slave replication

But it’s okay

Jonathan Hitchcock

@vhata

github.com/vhata

clearly, i have made some bad decisions

Technology

bcfg2server config

known build numberwith

puppetor cfengine

known build numberdeployment

metrics everywherethats

normalif things

codeand config managementdont

happeningbut dont