advanced topics - session 4 - architecting for high availability

Post on 10-Jul-2015

1.133 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

AWS provides a platform that is ideally suited for building highly available systems, enabling you to build reliable, affordable, fault-tolerant systems that operate with a minimal amount of human interaction. This presentation covers many of the high-availability and fault-tolerance concepts and features of the various services that you can use to build highly reliable and highly available applications in the AWS Cloud: architectures involving multiple Availability Zones, including EC2 best practices and RDS Multi-AZ deployments; loosely coupled and self-healing systems involving SQS and Auto Scaling; networking best practices for high availability, including Elastic IP addresses, load balancing, and DNS; leveraging services that inherently are built with high-availability and fault tolerance in mind, including S3, Elastic Beanstalk and more. Ianni Vamvadelis, Manager, Solution Architecture, AWS Daniel Richardson, Director of Engineering, JustEat

TRANSCRIPT

Ianni Vamvadelis, Solution Architect

Architecting for high

availability

2 2

What is High Availability (HA)?

• Percentage of time an application operates

• Loss of availability is known as an outage or downtime

– Planned and unplanned

– App is offline, unreachable, or partially available

– App is unresponsive

3 3

HA is related to …

• Scalability

– Often slow is indistinguishable from unavailable.

• Fault Tolerance

– Apps continue functioning when components fail

• Disaster Recovery

– Restoring service after a catastrophic event

4 4

HA and DR

• A continuum

• business continuity plan

• Not all or nothing proposition

In the face of internal or external events, how do you…

– Keep your applications running 24x7

– Make sure you data is safe

– Get an application recovered after a major disaster

High Availability Disaster Recovery

How does AWS Help

High Availability?

US-WEST (Oregon) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC

(Singapore)

US-WEST (N. California)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

AWS GovCloud (US)

ASIA PAC (Sydney)

US-WEST (Oregon)) EU-WEST (Ireland)

ASIA PAC (Tokyo)

ASIA PAC

(Singapore)

US-WEST (N. California)

SOUTH AMERICA (Sao Paulo)

US-EAST (Virginia)

AWS GovCloud (US)

ASIA PAC (Sydney)

8 8

Automation

AWS SERVICES

Inherently Highly Available and Fault Tolerant Services

Highly Available with the right architecture

Amazon S3

Amazon DynamoDB

Amazon CloudFront

Amazon Route53

Elastic Load Balancing

Amazon SQS

Amazon SNS

Amazon SES

Amazon SWF

Amazon EC2

Amazon EBS

Amazon RDS

Amazon VPC

AWS

Principles for HA

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

LET’S BUILD A

HIGHLY AVAILABLE SYSTEM

#1 DESIGN FOR FAILURE

●○○○○

« Everything fails all the time »

Werner Vogels

CTO of Amazon

AVOID SINGLE POINTS OF FAILURE

AVOID SINGLE POINTS OF FAILURE

ASSUME EVERYTHING FAILS,

AND WORK BACKWARDS

YOUR GOAL

Applications should continue to function

AMAZON EBS ELASTIC BLOCK STORE

AMAZON ELB ELASTIC LOAD BALANCING

HEALTH CHECKS

#2 MULTIPLE

AVAILABILITY ZONES ●●○○○

AMAZON RDS

MULTI-AZ

AMAZON ELB AND

MULTIPLE AZs

#3 SCALING

●●●○○

AUTO SCALING SCALE UP/DOWN EC2 CAPACITY

#4 SELF-HEALING

●●●●○

HEALTH CHECKS

+ AUTO SCALING

HEALTH CHECKS

+ AUTO SCALING

=

SELF-HEALING

DEGRADED MODE

AMAZON S3 STATIC WEBSITE

+ AMAZON ROUTE 53

WEIGHTED RESOLUTION

#5 LOOSE

COUPLING ●●●●●

BUILD LOOSELY COUPLED SYSTEMS

The looser they are coupled, the bigger they scale,

the more fault tolerant they get…

AMAZON SQS SIMPLE QUEUE SERVICE

PUBLISH& NOTIFY

RECEIVE TRANSCODE

PUBLISH& NOTIFY

RECEIVE TRANSCODE

VISIBILITY TIMEOUT

BUFFERING

CLOUDWATCH METRICS FOR AMAZON SQS

+ AUTO SCALING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

1. DESIGN FOR FAILURE

2. MULTIPLE AVAILABILITY ZONES

3. SCALING

4. SELF-HEALING

5. LOOSE COUPLING

YOUR GOAL

Applications should continue to function

IT’S ALL ABOUT

CHOICE BALANCE COST & HIGH AVAILABILITY

117 117

Summary

Leverage AWS Services

Apply 5 principles for HA

Automate

Test your HA implementation

118 118

aws.amazon.com/architecture

JUST EAT HIGH AVAILABILITY WITH AWS

120

JUST EAT

13 countries

34,000+ restaurants

8m+ members

Over 50m orders

16,000+ restaurants in UK, 8m visits a month

121

PLATFORM Devices in restaurants

Consumer Website

Public API

Order API Ratings API Search API …

Restaurant Services

SQL Server Networking Monitoring

Customer Care Tools

Emails

Common Infrastructure

Apps and External Services

APIs

122

DESIGN FOR FAILURE

Device Service

Auto scaling Group

eu-west-1a

Orders queue

Orders data

Devices in restaurants

eu-west-1b

eu-west-1c

Web Service

Auto scaling Group

eu-west-1a

eu-west-1b

eu-west-1c

Web Service

Web Service

JCT Service Device Service

123

SCALING - PROACTIVE

123

124

SCALING - PROACTIVE

Web servers in data center

125

SCALING – PROACTIVE

Web servers in data center

Web EC2 instances

126

SCALING – REACTIVE

Web servers in data center

Web EC2 instances

127

EVERYTHING MULTI AZ – CONSUMER WEBSITE

Auto scaling Group

eu-west-1a eu-west-1b eu-west-1c

Monitor to keep resource usage at max of 66% of capacity in each AZ

when everything’s available.

66% 66% 66% 99% 99%

128 128

EVERYTHING MULTI AZ – INTERNAL APIS

Auto scaling Group

eu-west-1a eu-west-1b eu-west-1c

Alarms tell us that performance has been degraded – but platform will

self heal as new instances are launched.

Applications assume that internal APIs will fail or run slowly. So can cope with the loss of an AZ

or instances – will just degrade gracefully.

80% 80% 80% 100% 100%

129 129 129

EVERYTHING MULTI AZ – SQL SERVER 2012

eu-west-1a eu-west-1b eu-west-1c

Connection strings simply contain both primary and secondary servers –

no code changes required.

Primary Witness Secondary Alarms tell us that failover has

occurred, but it happens without manual intervention.

DANIEL RICHARDSON

DIRECTOR OF ENGINEERING, JUST EAT

daniel.richardson@just-eat.com

130

www.just-eat.com/jobs

twitter.com/JustEatUK

www.facebook.com/justeat

top related