devops nirvana: seven steps to a peaceful life on aws (arc210) | aws re:invent 2013

Seven Steps to a Peaceful Life on AWS

Andrew ShiehSmugMug

@shandrew

Philip Jacob Stackdriver @whirlycott

Friday, November 15, 13

Stuff we have in common

✓ Years of AWS experience✓ Success and failure with many lessons learned✓ Both using Stackdriver for infrastructure monitoring✓ Lots of data✓ Philosophically aligned on how to run on AWS‣ Superheroes

Transition to Distributed SystemsLure of Elasticity

Peak of Expectations

DevOps Nirvana

Operational Enlightenment

CLOUD HYPE

1: Apply lean production principles

Release all the time: continuous improvement

Make it frictionless

$ stack deploy

2: Choose the right instance type

Factors to Consider

CPUNetworkDisk I/O

WorkloadCost

Tools to help you decide

vmstatiostatsarR

ExcelStackdriver + agent

21%$20%$

7%$7%$

3%$ 2%$ 2%$ 2%$1%$ 1%$

0%$ 0%$ 0%$ 0%$ 0%$

m1.large$

m1.small$

c1.medium

c1.xlarge$

t1.micro$

m1.xlarge$

m2.xlarge$

m2.2xlarge$

m2.4xlarge$

m3.xlarge$

m3.2xlarge$

cc2.8xlarge$

hi1.4xlarge$

cg1.4xlarge$

hs1.8xlarge$

cc1.4xlarge$

cr1.8xlarge$

Distribu=on$of$EC2$Instance$Usage$

3: Use configuration management

4: Choose the right monitoring solution

Rapid Setup Full-stack AWS Integration IntelligentCluster-aware

5: Design effective alerting policies

Simple rules for confidently waking up ops@ at 3am

1.Something had better be broken (or close to it) for the customer

2.The broken thing should be as obvious as possible

3. It should be clear what action I can take to make the situation better

Customers seeing huge spike in 5XX errors

Code deploy to web cluster one hour ago

Revert!

6: Architect for high availability

Elastic Load BalancingAmazon RDSApache

Zookeeper

Cell-1GW

Cell-2GW

Cloud Integration System Agents Custom Metrics

Load Balancing 1 Load Balancing nLoad Balancing 2

DNSData Ingestion

Archival Online Analysis

Serving

WorkersWorkers

Workers

AgentsAgents

Agents

APIAPI

Cassandra

Aggregation Correlation Trending

Web/Mobile

o UIUI

Anomaly

Health

Cell-nGW

Elastic Load Balancingw/ haproxy

Localized failureIdentical dimensions

Easy to reasonNetwork partitions ok

Handling failure

Avoid it

Mask it

Minimize it

Recover quickly

Cluster AZ Region

Resilience

Tolerance

7: Think holistically about quality assurance

AUTOSCALING +AUTOMATION +CONTINUOUS INTEGRATION +DEVOPS GOVERNANCE +ELASTICITY +PROGRAMMABLE INFRASTRUCTURE =CONSTANT CHANGE

You cannot pre-test every change

You need to be really good at detecting issues

Very quickly

Monitoring is a key part of quality assurance for dynamic systems

But monitoring tools need to be intelligent

Distributed sensorsCloud-aware

Anomaly detectionSynthetic transactions

• Training• Recommended reading:

• Systemantics (aka The Systems Bible)

• High Scalability (http://highscalability.com/)

• James Hamilton’s blog (http://perspectives.mvdirona.com/)

Visit us at http://www.smugmug.com/

Visit us at booth 315!

devops nirvana: seven steps to a peaceful life on aws (arc210) | aws re:invent 2013

xla rg e

load balancing ncell

dynamic systems

philip jacob stackdriver

right instance typefriday

devops governance

programmable infrastructure

network partitions

Technology

aws re:invent hackathon

understanding aws storage options (stg101) | aws re:invent...

understanding aws database options (dat201) | aws re:invent...

aws re:invent 2016 - scality's open source aws s3 server

netapp private storage for aws (ent216) | aws re:invent 2013

mobile game architectures on aws (mbl201) | aws re:invent...

recap of aws re:invent 2015

bluesoft @ aws re:invent 2017 + aws 101

aws security ideas - re:invent 2016

continuous deployment @ aws re:invent

feedback on aws re:invent 2016

aws re:invent 2017 recap

aws re:invent 2016: new launch! introducing aws greengrass...

data replication options in aws (arc302) | aws re:invent...

(web305) migrating your website to aws | aws re:invent 2014

aws billing deep dive (dmg203) | aws re:invent 2013

aws re:invent 2016: high performance computing on aws...

zero to sixty: aws opsworks (dmg202) | aws re:invent 2013

aws re:invent 2016: aws partners and data privacy (gpst303)

migrating my.t-mobile.com to aws (ent214) | aws re:invent...