Transcript
Page 1: Designing apps for resiliency

Designing Apps for Resiliency

Masashi NarumotoPrincipal Lead PMAzureCAT patterns & practices

Page 2: Designing apps for resiliency

Agenda• What is ’resiliency’?• Why it’s so important?• Process to improve resiliency• Resiliency checklist

Page 3: Designing apps for resiliency

What is ‘Resiliency’?• Resiliency is the ability to recover from failures and continue to

function. It's not about avoiding failures, but responding to failures in a way that avoids downtime or data loss.

• High availability is the ability of the application to keep running in a healthy state, without significant downtime.

• Disaster recovery is the ability to recover from rare but major incidents: Non-transient, wide-scale failures, such as service disruption that affects an entire region.

Page 4: Designing apps for resiliency

Why it’s so important?• More transient faults in the cloud• Dependent service may go down• SLA < 100% means something could go wrong at some point• More focus on MTTR rather than MTBF

Page 5: Designing apps for resiliency

Process to improve resiliency

Plan Design Implement Test Deploy Monitor Respond

Define requirements

Identifyfailures

Implementrecovery strategies

Inject failuresSimulate FO

Deploy apps in a reliable manner

Monitor failures

Take actions to fix issues

Page 6: Designing apps for resiliency

Defining resiliency requirements

Major incident occurs Service recoveredData backupData backupData backup

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

RPO: The maximum time period in which data might be lost RTO: Duration of time in which the service must be restored after an incident

Business recovered

Maximum Tolerable Outage (MTO)

Page 7: Designing apps for resiliency

SLA (Service Level Agreement)

Page 8: Designing apps for resiliency

Composite SLAComposite SLA = ? Composite SLA = ?

Cache

Fallback action:Return data from local cache

99.94% 99.95%99.95%99.95% x 99.99% = 99.94%

1.0 − (0.0001 × 0.001) = 99.99999%

Composite SLA for two regions = (1 − (1 − N)(1 − N))

x Traffic manager SLA

1 – (1 – 0.9995) x ( 1 – 0.9995)= 0.99999975(1 – (1 – 0.9995) x ( 1 – 0.9995)) x 0.9999 = 0.999899

Page 9: Designing apps for resiliency

Designing for resiliencyReading data from SQL Server fails

A web server goes down

A NVA goes down

1. Identify possible failures2. Rate risk of each failure

(impact x likelihood)3. Design resiliency strategy

- Detection- Recovery- Diagnostics

Page 11: Designing apps for resiliency

Rack awareness

Web tier Availability set

Middle tierAvailability set

Data tierAvailability set

Fault domain 1

Replica #1

Replica #1

Replica #2

Fault domain 2 Fault domain 3

Shard #2Shard #1

Page 12: Designing apps for resiliency

Load balance multiple instances

Application gateway for- L7 routing- SSL termination

Page 13: Designing apps for resiliency

Failover / Failback

Traffic managerPriority routing method

Web

Appl

icati

on

Data

Web

Appl

icati

on

Data

Auto

mat

ed fa

ilove

r

Man

ual f

ailb

ack

Primary region

Secondary region (regional pair)

Web

Web

Web

Data

Appl

icati

onAp

plic

ation

Data

Page 14: Designing apps for resiliency

Data replicationAzure storage

Geo replica (RA-GRS)

LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly

Periodically check If it’s back online

Page 15: Designing apps for resiliency

Retry transient failures

See ‘Azure retry guidance’ for more details

< E2E latency requirement

Page 16: Designing apps for resiliency

Circuit Breaker

Remote service

Your application

User

Hold resources while retrying operationLead to cascading failures

Failed

Page 17: Designing apps for resiliency

Circuit Breaker

https://github.com/App-vNext/Polly

Page 18: Designing apps for resiliency

Bulkhead

Service A Service B Service C

Thread pool Thread pool Thread pool

Workload 1 Workload 2

Thread pool Thread poolThread pool

Workload 1 Workload 2MemoryCPUDiskThread poolConnection poolNetwork connection

Page 19: Designing apps for resiliency

Other design patterns for resiliency• Compensating transaction• Scheduler-agent-supervisor• Throttling• Load leveling• Leader election

See ‘Cloud design patterns’

Page 20: Designing apps for resiliency

Principles of chaos engineering• Build hypothesis around steady state behavior• Vary real-world events• Run experiments in production• Automate experiments to run consistently

http://principlesofchaos.org/

Control Group

Experimental Group

HW/SW failuresSpike in traffic

Verify differenceIn terms of steady state

Feed production traffic

Page 21: Designing apps for resiliency

Testing for resiliency• Fault injection testing

• Shut down VM instances• Crash processes• Expire certificates• Change access keys• Shut down the DNS service on domain controllers• Limit available system resources, such as RAM or number of threads• Unmount disks• Redeploy a VM

• Load testing • Use production data as much you can• VSTS, JMeter

• Soak testing• Longer period under normal production load

Page 22: Designing apps for resiliency

Blue/Green and Canary release

Web App DB

Web App DB

Blue/Green Deployment

Web App DB

Web App DB

Canary release

90%

10%

Current version

New version

Current version

New version

Load

Bal

ance

r

Reve

rse

Prox

y

Page 23: Designing apps for resiliency

Deployment slots at App Service

Page 24: Designing apps for resiliency

Dark launching

New feature

Toggle enable/disable

User Interface

Production environment

Page 26: Designing apps for resiliency

Other resources

http://docs.microsoft.com/Azure

Page 27: Designing apps for resiliency

Resiliency / High Availability / Disaster Recovery

ThrottlingCircuit breaker

Zero downtime deploymentEventual consistency

Data restore

RetryGraceful degradation

Geo-replicaMulti-region deployment


Top Related