surviving an amazon outage
TRANSCRIPT
©Continuent 2012.
Surviving An Amazon Outage
Neil Armitage, Cluster implementation Engineer, Continuent
Wednesday, 24 April 13
©Continuent 2012 2
Overview
• Continuent’s external/internal infrastructure is built in AWS
• Review carried out in the Summer of 2012 after several AWS Outages
• Treated the review as a Customer engagement
• Further review in Autumn of 2012 leading to the Multi-Cloud deployment
Wednesday, 24 April 13
©Continuent 2012
What is AWS
Amazon Web Services is a collection of remote computing services (also called web services)
that together make up a cloud computing platform.
The central services are EC2 (Compute) and S3 (Storage) Services.
3
Wednesday, 24 April 13
©Continuent 2012
AWS Regions
4
Ireland(3 AZ)
Sao Paulo(2 AZ)
Northern Virginia(5 AZ)
Oregon(3 AZ)
California(3 AZ)
Singapore(2 AZ)
Tokyo(3 AZ)
Sydney(2 AZ)
Wednesday, 24 April 13
©Continuent 2012
AWS Availability Zones
5
Region
Availability Zone Availability Zone
Availability Zone
Region
Availability Zone Availability Zone
Wednesday, 24 April 13
©Continuent 2012
AWS Services
• Compute EC2
• Network - Route 53 and Virtual Private Cloud (VPC)
• Content Delivery - Cloudfront
• Storage - S3, Glacier, EBS
• Database - DynamoDB, RDS, RedShift, SimpleDB
• Deployment - Cloudformation, Beanstalk, OpsWorks
6
Wednesday, 24 April 13
©Continuent 2012
AWS Size*
• Between 100K and 500K physical servers
• 1.5million Public IP Addresses
• S3 holds > 2 Trillion objects - 1.1m requests per second
• 1/3 of daily users access a site running on AWS
• 1% of internet tra!c goes through Amazon Infrastructure
7
* Estimates based on various internet sources
Wednesday, 24 April 13
©Continuent 2012
Continuent Systems
• External facing website
• Jira/Con"uence internal systems
• Subversion
• Jenkins build system
8
Wednesday, 24 April 13
©Continuent 2012
External Website
9
Internet ElasticIP
Web Server
DBServer
Region
Availability Zone
Wednesday, 24 April 13
©Continuent 2012
Jira/Con!uence/Subversion
10
Internet ElasticIP
App ServerJira
ConfluenceSVN ServerMySQL
Availability Zone
Region
Wednesday, 24 April 13
©Continuent 2012
AWS Problems Summer 2012
“Amazon Cloud Hit by Real Clouds, Downing Net!ix, Instagram, Other Sites”
Severe Storms caused power outages at AWS US-East Data centers, generators failed taking out 7% of EC2 instances.http://www.pcworld.com/article/258627/amazon_cloud_hit_by_real_clouds_knocking_out_popular_sites_like_netflix_instagram.html
11
Wednesday, 24 April 13
©Continuent 2012
Migration Plan
• Move to a clustered Continuent Tungsten environment
• Ensure all components are replicated into at least one other AWS Region
• Limited downtime on Customer facing systems
• Minimal downtime on internal systems
12
Wednesday, 24 April 13
©Continuent 2012 13
MasterSlave Slave
App Logic
Tungsten Connector
Replicator Replicator Replicator
App Logic
Tungsten Connector
Manager Manager Manager
Data Service: nyc
Wednesday, 24 April 13
©Continuent 2012 13
MasterSlave Slave
App Logic
Tungsten Connector
Replicator Replicator Replicator
App Logic
Tungsten Connector
Manager Manager Manager
Monitoring and control
Monitoring and control
Data Service: nyc
Wednesday, 24 April 13
©Continuent 2012 13
MasterSlave Slave
App Logic
Tungsten Connector
Replicator Replicator Replicator
App Logic
Tungsten Connector
Manager Manager Manager
Monitoring and control
Monitoring and control
Data Service: nyc
Wednesday, 24 April 13
©Continuent 2012 13
MasterSlave Slave
App Logic
Tungsten Connector
Replicator Replicator Replicator
App Logic
Tungsten Connector
Manager Manager Manager
Monitoring and control
Monitoring and control
Data Service: nyc
Wednesday, 24 April 13
©Continuent 2012
Website Database Tier - Round 1
14
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Connectors
Wednesday, 24 April 13
©Continuent 2012
DB Failures - Failure in US-EAST-1C
15
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Connectors
Wednesday, 24 April 13
©Continuent 2012
DB Failures - Failure in US-EAST
16
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Connectors
Wednesday, 24 April 13
©Continuent 2012 17
DEMO
Wednesday, 24 April 13
©Continuent 2012
Website Web Tier - Round 1
18
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Internet
EIP
Wednesday, 24 April 13
©Continuent 2012
Web Failures - Failure in US-EAST-1C
19
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Internet
EIP
Wednesday, 24 April 13
©Continuent 2012
Web Failures - Failure in US-EAST
20
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1B 1C 1C
S3 Backups
S3 Backups
Internet
EIP
DNS Update
Wednesday, 24 April 13
©Continuent 2012
Jira/Con!uence/SVN - Round 1
21
Region
Availability Zone
Region
Availability Zone
US-EAST-1 US-WEST-1
1C 1C
S3 Backups
S3 Backups
Internet
EIP
Wednesday, 24 April 13
©Continuent 2012
AWS Failures - Autumn 2012
“Amazon Web Services outage takes out popular websites again”
•EBS degraded performance
•Problems allocating new volumes
http://www.pcworld.com/article/2012852/amazon-web-services-outage-takes-out-popular-
websites-again.html
22
Wednesday, 24 April 13
©Continuent 2012
Website Database Tier - Round 2
23
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1
US-WEST-1
1B 1C
1C
S3 Backups
S3 Backups
RackSpace
Wednesday, 24 April 13
©Continuent 2012
Website Web Tier - Round 2
24
Region
Availability Zone Availability Zone
Region
Availability Zone
US-EAST-1
US-WEST-11B 1C
1C
S3 Backups
S3 Backups
Internet
EIP
RackSpace
Wednesday, 24 April 13
©Continuent 2012
Jira/Con!uence/SVN - Round 2
25
Region
Availability Zone Region
Availability Zone
US-EAST-1
US-WEST-11C
1C
S3 Backups
S3 Backups
Internet
EIP
RackSpace
Wednesday, 24 April 13
©Continuent 2012
Best Practices
• RAID EBS Volumes (RAID1)
• Backups
• xtrabackup (backed up into S3)
• EBS Snapshot
26
ec2-‐consistent-‐snapshot \ -‐-‐mysql -‐-‐freeze-‐filesystem /vol \ -‐-‐region eu-‐west-‐1 \ -‐-‐description "$(hostanme) RAID snapshot $(date +'%Y-‐%m-‐%d %H:%M:%S')" \ vol-‐1f9a6446 vol-‐649a643d
Wednesday, 24 April 13
©Continuent 2012
Best Practices
• Monitoring
• Nagios scripts converted to email alerts
• New Relic
27
Wednesday, 24 April 13
©Continuent 2012
Lesson Learnt
• EC2 Instances fail
• One of anything is never enough
• Don’t assume you can spin up more resources instantly
• Think multi-cloud, public/private
• Resources are disposable - throw away and rebuild if needed
28
Wednesday, 24 April 13
©Continuent 2012
Further Plans
• Realtime replication of web assets (glusterFS?)
• Introduce a Elastic Load Balancer in front of US-EAST Web servers to allow for auto web failover
• Migrate into a VPC
• Investigate Route 53 for DNS Failover
29
Wednesday, 24 April 13
©Continuent 2012 30
We are Recruiting
Come to our booth for more infomation
Wednesday, 24 April 13
©Continuent 2012 31
Continuent Website:http://www.continuent.com
Tungsten Replicator 2.0:http://code.google.com/p/tungsten-replicator
Our Blogs:http://scale-out-blog.blogspot.comhttp://datacharmer.blogspot.comhttp://flyingclusters.blogspot.com
560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009e-mail: [email protected]
Wednesday, 24 April 13