Designing apps for resiliency
Post on 16-Apr-2017
Designing Apps for Resiliency
Designing Apps for ResiliencyMasashi NarumotoPrincipal Lead PMAzureCAT patterns & practices
AgendaWhat is resiliency?Why its so important?Process to improve resiliencyResiliency checklist
Everybody is talking about it but the its definition is not clear. Ill clarify what it meansWhy everybody is taking about it? Theres a number of reasonsMain part of this topic is how to make your app resilient.Ill show you some of the example of checklist2
What is Resiliency?Resiliency is the ability to recover from failures and continue to function. It's not aboutavoidingfailures, butrespondingto failures in a way that avoids downtime or data loss.High availability is the ability of the application to keep running in a healthy state, without significant downtime.
Disaster recovery is the ability to recover from rare but major incidents: Non-transient, wide-scale failures, such as service disruption that affects an entire region.
DR? Data backup? These are all true statements but none of them clearly define what resiliency means.
In order to be HA, it doesnt need to go down and come back online. If youre app is running w/ 100% uptime w/o any failures, its HA but you never know if its resilient. Once something bad happens, then it may take days to come back online which is not really resilient at all.
DR needs to be a catastrophic failure such as something that could take down entire DC. For example..
Why its so important?More transient faults in the cloudDependent service may go downSLA < 100% means something could go wrong at some pointMore focus on MTTR rather than MTBF
Why its so important? Why everybody is talking about resiliency?
Transient faults because of commodity HW, networking, multi-tenant shared modelRemote services could go down at any time99.99% means 4 mins downtime a month. Do you want to sit down and wait for 4 minutes or do something else?Id rather do something because you never know its going to be 4 minutes or 4 hours.
Based on the assumption that anything goes wrong at some point, focus has been shifting from MTBF to MTTR
Process to improve resiliencyDefine requirementsIdentifyfailuresImplementrecovery strategiesInject failuresSimulate FODeploy apps in a reliable mannerMonitor failuresTake actions to fix issues
Were getting into more interesting part.We discussed what resiliency means, why its so important.Now were getting into how part.
This is the process to improve resiliency in your system in 7 steps from plan to respond.Lets talk about each step.Clearly define your requirements, otherwise you dont know what youre aiming forIdentify all possible failures you may see and Implement recovery strategies to bounce back from these failuresTo make sure these strategies work, you need to test them by injecting failuresDeployment needs to be resilient too. Because deploying new version is the most common cause of failuresMonitoring is key to QoS. Monitor errors, latency, throughputs etc. in percentile.You need to take actions quickly to mitigate the downtime
Defining resiliency requirementsMajor incident occurs
Data backupRecovery Time Objective (RTO)Recovery Point Objective (RPO)RPO: The maximum time period in which data might be lostRTO: Duration of time in which the service must be restored after an incident
Business recoveredMaximum Tolerable Outage (MTO)
Therere two common requirements when it comes to resiliency.
RPO: defines the interval of data backup RTO: defines the requirements for hot/warm/cold stand-byMTO: how long a particular business process can be down6
SLA (Service Level Agreement)
If you look at well-experienced customers, they define availability requirements per each use case.Decompose your workload and define availability requirements (uptime, latency etc.) per eachHigher SLA comes with cost because of redundant services/components.Measuring downtime will become an issue when you target 5nines7
Composite SLA = ?Composite SLA = ?CacheFallback action:Return data from local cache99.94%99.95%99.95%99.95% x 99.99% = 99.94%1.0 (0.0001 0.001) = 99.99999%Composite SLA for two regions = (1 (1 N)(1 N)) x Traffic manager SLA1 (1 0.9995) x ( 1 0.9995)= 0.99999975(1 (1 0.9995) x ( 1 0.9995)) x 0.9999 = 0.999899
The fact that App Service offers 99.95% doesnt mean that the entire system has 99.95%.Other important fact is that SLA doesnt guarantee that it always up 99.95% of the time. Youll get money back when it violates SLA.Its not just a number game. This is where resiliency comes into play.
SLA is not guaranteed. If we dont meet SLA, you get money back.Definition of SLA varies depending on the service.8
Designing for resiliency
Reading data from SQL Server failsA web server goes downA NVA goes downIdentify possible failuresRate risk of each failure (impact x likelihood)Design resiliency strategy- Detection- Recovery- Diagnostics
In order to design your app to be resilient, you need to identify all possible failures first.Then implement resilient strategies against them,
Failure mode analysis
To help you identify all possible failures, we published list of most common failures on Azure.It has a few items per each service. 30 to 40 items in total. Lets take a look.
In the case of DocumentDB. When you fail to read data from it, the client SDK retries the operation for you. The only transient fault it retries against is throttling (429). If you constantly get 420, consider increasing its scale (RU)DocumentDB now supports geo-replica. If primary region fails, it will switch traffic to other regions in the list you configureFor diagnostics, you need to log all errors at client side.
Web tier Availability setMiddle tierAvailability setData tierAvailability setFault domain 1
Replica #1Replica #1Replica #2
Fault domain 2Fault domain 3
Shard #2Shard #1
You can think of rack as power module. If it goes down, anything belong to it go down all together.So its better to distribute VMs across different racks for redundancy sake.This is where availability set comes into play.Each machine in the same AS belongs to different rack.VMSS automatically put VMs in 5 FD, 5 UD but it doesnt support data disk yet.
Load balance multiple instances
Application gateway forL7 routingSSL termination
Avoiding SPOF is critical for resiliency. Many customers still dont know these basics. They deploy critical workload on a single machine.For that, you nee to have redundant components. One goes down but still others are running. In this case, put VMs in the same tier into the same availability set with LB.LB would distribute requests to VMs in backend address poolHealth probe can be either Http or Tcp depending on the workload.By default it pings root path /. You may want to expose health endpoint to monitor all critical component.
Failover / Failback
Traffic managerPriority routing method
Automated failoverManual failbackPrimary regionSecondary region (regional pair)
Theres a risk of data loss in FO, take a snapshot and ensure the data integrity.13
Azure storageGeo replica (RA-GRS)
LocationMode = PrimaryThenSecondaryLocationMode = SecondaryOnly
Periodically check If its back online
If its less frequent transient faults, set the property to PrimaryThenSecondary. Itll switch to secondary region for youIf its more frequent or non-transient faults, set the property to SecondaryOnly otherwise it keeps hitting and getting errors from primary.
You need to monitor the primary region, when it comes back then set the property back to PrimaryOnly or PTS
One thing to notice is that Azure storage wouldnt failover to secondary until reginal wide disaster happens which I dont think we have had yet.
This strategy is applicable for read not write.
Retry transient failures
See Azure retry guidance for more details< E2E latency requirement
Lets take a look at a few resiliency strategies to recover from failures you identified above.
Exponential back-off for non-interactive transactionQuick liner retry for interactive transaction
Anti-patterns:Cascading retry (5x5 = 25)More than one immediate retryMany attempts with regular interval (Randomize interval)
Hold resources while retrying operationLead to cascading failures
People often say dont waste your time, lets circuit break and fail fastThat is only a part of the problem. Real issues is the cascading failures.Also by keep retrying failed operations, the remote service cant recover from the failed state
BulkheadService AService BService CThread poolThread poolThread pool
Workload 1Workload 2Thread poolThread poolThread pool
Workload 1Workload 2MemoryCPUDiskThread poolConnection poolNetwork connection
Type of resources to isolate are not limited to but they are most common ones.18
Other design patterns for resiliencyCompensating transactionScheduler-agent-supervisorThrottlingLoad levelingLeader election
See Cloud design patterns
Principles of chaos engineeringBuild hypothesis around steady state behaviorVary real-world eventsRun experiments in productionAutomate experiments to run consistentlyhttp://principlesofchaos.org/
Control GroupExperimental GroupHW/SW failuresSpike in trafficVerify differenceIn terms of steady stateFeed production traffic
Given the chaotic nature of the cloud and distributed system, always something happens somewhere.
it makes sense to follow chaos engineering principles.
Define the steady state as the measurable outputof a system, rather than internal attributes of the systemIntroduce real-world chaotic events such as HW-failure, SW-failure, spike in traffic etc.Best way to validate the system at production scale is to run experiment in production.Netflix at least once a month, inject faults in one of their regions to see if their system can keep up and running.Since its such a time consuming tasks, you should automate the experiments and run them continuously
Chaos engineering is not testing, its validation of the system.
Testing for resiliencyFault injection testingShut down VM instancesCrash processesExpire certificatesChange access keysShut down the DNS service on domain controllersLimit available system resources, such as RAM or number of threadsUnmount disksRedeploy a VMLoad testing Use production data as much you canVSTS, JMeterSoak testingLonger period under normal production load
Tools = Chaos monkey/kong, ToxiProxyhttps://en.wikipedia.org/wiki/Soak_testing
Blue/Green and Canary releaseWebAppDBWebAppDB
Canary release90%10%Current versionNew versionCurrent versionNew versionLoad BalancerReverse Proxy
Deploy current and new version into two identical environments (blue, green)Do smoke test on new version then switch traffic to it.
Canary release is to incrementally switches from current to new using LB.Use Akamai or equivalent to do Canary. The unique name for this environment comes from a tactic used by coal miners: theyd bring canaries with them into the coal mines to monitor the levels of carbon monoxide in the air; if the canary died, they knew that the level of toxic gas in the air was high, and theyd leave the mines.
In either case you should be able to rollback if the new version doesnt workGraceful shutdown and Switching DB/Storage are the challenge.
Github route request to blue and green, compares the result from blue and green. Make sure they are identical.
Dark launch: Deploy new features without enabling it to users. Make sure it wont cause any issues in production, then enable it.
Deployment slots at App Service
This is how it works in App Service. You can have up to 15 deployment slots23
New featureToggle enable/disableUser InterfaceProduction environment
Deploy a new feature to prod env without enabling it to users. Make sure it works with in the prod infrustracture, no memory leaks, no nothing.Then enable it to users on UI. If something bad happens, then disable it in UI.Facebook does this.24
All other proven practices are in this doc.
You can use this list when you have ADR with your customers. Give us feedback.25
Resiliency / High Availability / Disaster Recovery
ThrottlingCircuit breakerZero downtime deploymentEventual consistencyData restoreRetryGraceful degradationGeo-replicaMulti-region deployment