devops nirvana: seven steps to a peaceful life on aws (arc210) | aws re:invent 2013

Post on 12-Jan-2015

1.850 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

(Presented by Stackdriver) Key decisions related to architecture, tools, processes, and even team composition can have a dramatic effect on the human effort required to operate distributed applications on AWS. If you make the wrong decisions on in these areas, you spend your days, nights, weekends, and vacations dealing with issues and noise. If you make the right decisions, you and your team can focus on building customer value, and your time away from work is spent… not working. Stackdriver and Smugmug describe the seven most important practices that world-class operations teams employ to minimize operational overhead, highlighting real-world examples to illustrate the importance of each.

TRANSCRIPT

Seven Steps to a Peaceful Life on AWS

Andrew ShiehSmugMug

@shandrew

Philip Jacob Stackdriver @whirlycott

Friday, November 15, 13

Friday, November 15, 13

Friday, November 15, 13

Stuff we have in common

✓ Years of AWS experience✓ Success and failure with many lessons learned✓ Both using Stackdriver for infrastructure monitoring✓ Lots of data✓ Philosophically aligned on how to run on AWS‣ Superheroes

Friday, November 15, 13

Friday, November 15, 13

Transition to Distributed SystemsLure of Elasticity

Peak of Expectations

DevOps Nirvana

Operational Enlightenment

CLOUD HYPE

TIME

Friday, November 15, 13

STEPS

Friday, November 15, 13

Friday, November 15, 13

1: Apply lean production principles

Friday, November 15, 13

Release all the time: continuous improvement

Friday, November 15, 13

Make it frictionless

Friday, November 15, 13

$ stack deploy

Friday, November 15, 13

Friday, November 15, 13

2: Choose the right instance type

Friday, November 15, 13

Factors to Consider

CPUNetworkDisk I/O

WorkloadCost

Tools to help you decide

vmstatiostatsarR

ExcelStackdriver + agent

Friday, November 15, 13

21%$20%$

12%$

11%$

9%$

7%$7%$

3%$ 2%$ 2%$ 2%$1%$ 1%$

0%$ 0%$ 0%$ 0%$ 0%$

m1.large$

m1.small$

m1.m

edium

$

c1.medium

$

c1.xlarge$

t1.micro$

m1.xlarge$

m2.xlarge$

m2.2xlarge$

m2.4xlarge$

m3.xlarge$

m3.2xlarge$

cc2.8xlarge$

hi1.4xlarge$

cg1.4xlarge$

hs1.8xlarge$

cc1.4xlarge$

cr1.8xlarge$

Distribu=on$of$EC2$Instance$Usage$

Friday, November 15, 13

+ EC2

Friday, November 15, 13

3: Use configuration management

Friday, November 15, 13

Friday, November 15, 13

4: Choose the right monitoring solution

Friday, November 15, 13

Friday, November 15, 13

Rapid Setup Full-stack AWS Integration IntelligentCluster-aware

Friday, November 15, 13

5: Design effective alerting policies

Friday, November 15, 13

Simple rules for confidently waking up ops@ at 3am

1.Something had better be broken (or close to it) for the customer

2.The broken thing should be as obvious as possible

3. It should be clear what action I can take to make the situation better

Customers seeing huge spike in 5XX errors

Code deploy to web cluster one hour ago

Revert!

Friday, November 15, 13

6: Architect for high availability

Friday, November 15, 13

Elastic Load BalancingAmazon RDSApache

Zookeeper

Friday, November 15, 13

AI

F

Cell-1GW

MQ

AI

F

Cell-2GW

MQ

Cloud Integration System Agents Custom Metrics

Load Balancing 1 Load Balancing nLoad Balancing 2

DNSData Ingestion

S3

Archival Online Analysis

Serving

WorkersWorkers

Workers

AgentsAgents

Agents

APIAPI

API

Q 1

2n

3

Cassandra

Batch

Aggregation Correlation Trending

Web/Mobile

o UIUI

Anomaly

Health

AI

F

Cell-nGW

MQ

Elastic Load Balancingw/ haproxy

Localized failureIdentical dimensions

Easy to reasonNetwork partitions ok

Friday, November 15, 13

Handling failure

Avoid it

Mask it

Minimize it

Recover quickly

Cluster AZ Region

Resilience

Tolerance

Friday, November 15, 13

7: Think holistically about quality assurance

Friday, November 15, 13

AUTOSCALING +AUTOMATION +CONTINUOUS INTEGRATION +DEVOPS GOVERNANCE +ELASTICITY +PROGRAMMABLE INFRASTRUCTURE =CONSTANT CHANGE

Friday, November 15, 13

You cannot pre-test every change

So

You need to be really good at detecting issues

Very quickly

Friday, November 15, 13

Monitoring is a key part of quality assurance for dynamic systems

But monitoring tools need to be intelligent

Distributed sensorsCloud-aware

Anomaly detectionSynthetic transactions

Friday, November 15, 13

• Training• Recommended reading:

• Systemantics (aka The Systems Bible)

• High Scalability (http://highscalability.com/)

• James Hamilton’s blog (http://perspectives.mvdirona.com/)

Friday, November 15, 13

Visit us at http://www.smugmug.com/

Friday, November 15, 13

Visit us at booth 315!

Friday, November 15, 13

top related