advanced topics - session 4 - architecting for high availability

Ianni Vamvadelis, Solution Architect

Architecting for high

availability

What is High Availability (HA)?

• Percentage of time an application operates

• Loss of availability is known as an outage or downtime

– Planned and unplanned

– App is offline, unreachable, or partially available

– App is unresponsive

HA is related to …

• Scalability

– Often slow is indistinguishable from unavailable.

• Fault Tolerance

– Apps continue functioning when components fail

• Disaster Recovery

– Restoring service after a catastrophic event

HA and DR

• A continuum

• business continuity plan

• Not all or nothing proposition

In the face of internal or external events, how do you…

– Keep your applications running 24x7

– Make sure you data is safe

– Get an application recovered after a major disaster

High Availability Disaster Recovery

How does AWS Help

High Availability?

US-WEST (Oregon) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC

(Singapore)

US-WEST (N. California)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

AWS GovCloud (US)

ASIA PAC (Sydney)

US-WEST (Oregon)) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC

(Singapore)

US-WEST (N. California)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

AWS GovCloud (US)

ASIA PAC (Sydney)

Automation

AWS SERVICES

Inherently Highly Available and Fault Tolerant Services

Highly Available with the right architecture

Amazon S3

Amazon DynamoDB

Amazon CloudFront

Amazon Route53

Elastic Load Balancing

Amazon SQS

Amazon SNS

Amazon SES

Amazon SWF

Amazon EC2

Amazon EBS

Amazon RDS

Amazon VPC

Principles for HA

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

LET’S BUILD A

HIGHLY AVAILABLE SYSTEM

#1 DESIGN FOR FAILURE

●○○○○

« Everything fails all the time »

Werner Vogels

CTO of Amazon

AVOID SINGLE POINTS OF FAILURE

ASSUME EVERYTHING FAILS,

AND WORK BACKWARDS

YOUR GOAL

Applications should continue to function

AMAZON EBS ELASTIC BLOCK STORE

AMAZON ELB ELASTIC LOAD BALANCING

HEALTH CHECKS

#2 MULTIPLE

AVAILABILITY ZONES ●●○○○

AMAZON RDS

MULTI-AZ

AMAZON ELB AND

MULTIPLE AZs

#3 SCALING

●●●○○

AUTO SCALING SCALE UP/DOWN EC2 CAPACITY

#4 SELF-HEALING

●●●●○

HEALTH CHECKS

+ AUTO SCALING

HEALTH CHECKS

+ AUTO SCALING

SELF-HEALING

DEGRADED MODE

AMAZON S3 STATIC WEBSITE

+ AMAZON ROUTE 53

WEIGHTED RESOLUTION

#5 LOOSE

COUPLING ●●●●●

BUILD LOOSELY COUPLED SYSTEMS

The looser they are coupled, the bigger they scale,

the more fault tolerant they get…

AMAZON SQS SIMPLE QUEUE SERVICE

PUBLISH& NOTIFY

RECEIVE TRANSCODE

PUBLISH& NOTIFY

RECEIVE TRANSCODE

VISIBILITY TIMEOUT

BUFFERING

CLOUDWATCH METRICS FOR AMAZON SQS

+ AUTO SCALING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

YOUR GOAL

Applications should continue to function

IT’S ALL ABOUT

CHOICE BALANCE COST & HIGH AVAILABILITY

117 117

Summary

Leverage AWS Services

Apply 5 principles for HA

Automate

Test your HA implementation

118 118

aws.amazon.com/architecture

JUST EAT HIGH AVAILABILITY WITH AWS

JUST EAT

13 countries

34,000+ restaurants

8m+ members

Over 50m orders

16,000+ restaurants in UK, 8m visits a month

PLATFORM Devices in restaurants

Consumer Website

Public API

Order API Ratings API Search API …

Restaurant Services

SQL Server Networking Monitoring

Customer Care Tools

Emails

Common Infrastructure

Apps and External Services

DESIGN FOR FAILURE

Device Service

Auto scaling Group

eu-west-1a

Orders queue

Orders data

Devices in restaurants

eu-west-1b

eu-west-1c

Web Service

Auto scaling Group

eu-west-1a

eu-west-1b

eu-west-1c

Web Service

JCT Service Device Service

SCALING - PROACTIVE

Web servers in data center

SCALING – PROACTIVE

Web EC2 instances

SCALING – REACTIVE

Web EC2 instances

EVERYTHING MULTI AZ – CONSUMER WEBSITE

Auto scaling Group

eu-west-1a eu-west-1b eu-west-1c

Monitor to keep resource usage at max of 66% of capacity in each AZ

when everything’s available.

66% 66% 66% 99% 99%

128 128

EVERYTHING MULTI AZ – INTERNAL APIS

Auto scaling Group

Alarms tell us that performance has been degraded – but platform will

self heal as new instances are launched.

Applications assume that internal APIs will fail or run slowly. So can cope with the loss of an AZ

or instances – will just degrade gracefully.

80% 80% 80% 100% 100%

129 129 129

EVERYTHING MULTI AZ – SQL SERVER 2012

Connection strings simply contain both primary and secondary servers –

no code changes required.

Primary Witness Secondary Alarms tell us that failover has

occurred, but it happens without manual intervention.

DANIEL RICHARDSON

DIRECTOR OF ENGINEERING, JUST EAT

daniel.richardson@just-eat.com

www.just-eat.com/jobs

twitter.com/JustEatUK

www.facebook.com/justeat

advanced topics - session 4 - architecting for high availability

Technology

architecting a vcloud -...

presentartion light - mobilunity · to early stage startups...

b776141bb4b7592b6152...

architecting rias

topics acid vs base starfish availability tacc model...

architecting cloud

architecting extremelylargescalewebapplications

architecting for high availability

architecting participation

architecting high availability linux environments within...

aceware nxt an aceware presentation. today’s topics why...

ibm z/os v2r2 performance and availability topics

architecting speed

architecting for the cloud scability-availability

aws summit tel aviv - startup track - architecting for high...

architecting osb for high availability and whole server...

1 architecting osb for high availability and whole server...

architecting availability and disaster recovery...

architecting a high performance storage...

architecting active directory on aws · aws managed vpc...