s econd s ite : d isaster t olerance as a s ervice shriram rajagopalan brendan cully ryan o’connor...
TRANSCRIPT
![Page 1: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/1.jpg)
SECONDSITE: DISASTER TOLERANCE AS A SERVICE
Shriram Rajagopalan
Brendan Cully
Ryan O’Connor
Andrew Warfield
![Page 2: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/2.jpg)
2
FAILURES IN A DATACENTER
![Page 3: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/3.jpg)
3
TOLERATING FAILURES IN A DATACENTER
Initial idea behind Remus was to tolerate Datacenter level failures.
REMUS
![Page 4: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/4.jpg)
4
CAN A WHOLE DATACENTER FAIL ?
Yes!It’s a “Disaster”!
![Page 5: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/5.jpg)
5
DISASTERS
Illustrative Image courtesy of TangoPango, Flickr.
“Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track.A single truck driver can take out sites like 37Signals in a snap.”
- Om Malik, GigaOM
“Truck driver in Texas kills all the websites you really use”
…Southlake FD found that he had low blood sugar
- valleywag.com
![Page 6: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/6.jpg)
6
DISASTERS..
Water-main break cripples Dallas County computers, operations
The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal.
- Dallas Morning News, Jun 2010
![Page 7: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/7.jpg)
7
DISASTERS..
![Page 8: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/8.jpg)
8
MORE FODDER BACK HOME
“An explosion … near our
server bank … electrical box containing 580 fiber cables.
electrical box … was covered in asbestos … mandated the wearing of hazmat suits ....
Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function.
In other words, the perfect storm. Oh well. S*it happens. ’’
-Dan Empfield, Slowswitch.com - a Gossamer Threads customer.
![Page 9: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/9.jpg)
9
DISASTER RECOVERY – THE OLD FASHIONED WAY
Storage replication between a primary and backup site.
Manually restore physical servers from backup images.
Data Loss and Long Outage periods.
Expensive Hardware – Storage Arrays, Replicators, etc.
![Page 10: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/10.jpg)
10
Protected Site
Recovery Site
VirtualCenter Site Recovery Manager
VirtualCenter Site Recovery Manager
Datastore Groups
Array Replication
Datastore GroupsX
STATE OF THE ART DISASTER RECOVERY
VMs offline
VMs powered on
VMs become unavailable
VMs online in Protected Site
Source: VMWare Site Recovery Manager – Technical Overview
![Page 11: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/11.jpg)
11
PROBLEMS WITH EXISTING SOLUTIONS
Data Loss & Service Disruption (RPO ~15min, RTO ~few hours)
Complicated Recovery Planning (e.g. service A needs to be up before B, etc.)
Application Level Recovery
Bottom Line: Current State of DR is Complicated Expensive Not suitable for a general purpose cloud-level offering.
![Page 12: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/12.jpg)
12
DISASTER TOLERANCE AS A SERVICE ?
Our Vision
![Page 13: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/13.jpg)
13
OVERVIEW
A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences
![Page 14: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/14.jpg)
14
PRIMARY & BACKUP SITES
5ms RTT
![Page 15: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/15.jpg)
15
FAILOVER & FAILBACK WITHOUT OUTAGE
Primary Site: VancouverBackup Site : Kamloops
Primary Site: VancouverPrimary Site: Kamloops
Primary Site: KamloopsBackup Site : Vancouver
Complete State Recovery (CPU, disk, memory, network)
No Application Level Recovery
![Page 16: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/16.jpg)
16
MAIN CONTRIBUTIONS
Remus (NSDI ’08) Checkpoint based State Replication Fully Transparent HA Recovery Consistency
No Application level recovery
RemusDB (VLDB’11) Optimize Server Latency Reduce Replication Bandwidth by up to 80% using
Page Delta Compression Disk Read Tracking
SecondSite (VEE’12) Failover Arbitration in Wide Area Stateful Network Failover over Wide Area
![Page 17: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/17.jpg)
17
CONTRIBUTIONS..
![Page 18: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/18.jpg)
18
FAILURE DETECTION IN REMUS
External Network
Primary
NIC1
NIC2
Backup
NIC1
NIC2Checkpoints
• A pair of independent dedicated NICs carry replication traffic.
• Backup declares Primary failure only if
• It cannot reach Primary via NIC 1 and NIC2
• It can reach External N/W via NIC1
• Failure of Replication link alone results in Backup shutdown.
• Split Brain occurs only when both NICs/links fail.
LAN
![Page 19: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/19.jpg)
19
FAILURE DETECTION IN WIDE AREA DEPLOYMENTS
Cannot distinguish between link and node failure.
Higher chances of Split Brain as the network is not reliable anymore
External Network
Primary
NIC1
NIC2
Backup
NIC1
NIC2Checkpoints
LAN
WAN
PrimaryDatacent
er
BackupDatacent
er
ReplicationChannel
INTERNET
![Page 20: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/20.jpg)
20
FAILOVER ARBITRATION
Local Quorum of Simple Reachability Detectors.
Stewards can be placed on third party clouds.
Google App Server implementation with ~100 LoC.
Provider/User could have other sophisticated implementations.
![Page 21: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/21.jpg)
21
Stewards1 2 3
4 5
FAILOVER ARBITRATION..
Replication Stream
POLL
1
Primary
QuorumLogic
Backup
QuorumLogic
Apriori Steward Set Agreement
I need majority to stay alive
I need exclusive majority to
failover
XX
XX
X
POLL
2PO
LL 3
POLL 4
POLL 5POLL 1
POLL 2POLL 3
POLL 4
POLL 5
![Page 22: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/22.jpg)
22
NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION
Remus – LAN - Gratuitous ARP from Backup Host
SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter
Need support from upstream ISP(s) at both Datacenters
IP Migration achieved through BGP Multi-homing
![Page 23: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/23.jpg)
23
NETWORK FAILOVER WITHOUT SERVICE INTERRUPTION..
Internet
BCNet (AS-271)
VMs
Vancouver(134.87.2.173
)
Kamloops(207.23.255.23
7)
134.87.2.174
AS-64678 (stub)(134.87.3.0/24)
207.23.255.238
VMs VMs
Primary Site Backup Site
AS-64678 (stub)(134.87.3.0/24)
BGP Multi-homing
Replication
Routing traffic to Primary Site
Re-routing traffic to Backup Site on Failover
as-path prepend64678 64678
as-path prepend64678 64678 64678 64678
as-path prepend64678
![Page 24: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/24.jpg)
24
OVERVIEW
A Case for Commoditizing Disaster Tolerance SecondSite – System Design Evaluation & Experiences
![Page 25: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/25.jpg)
25
I want periodic failovers with no downtime!
Did you run regression tests ?
Failover Works!!
More than one failure ?
I will have to restart HA!
EVALUATION
![Page 26: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/26.jpg)
26
RESTARTING HA
Need to Resynchronize Storage.
Avoiding Service Downtime requires Online Resynchronization
Leverage DRBD –only resynchronizes blocks that have changed
Integrate DRBD with Remus Add checkpoint based asynchronous disk replication protocol.
![Page 27: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/27.jpg)
27
REGRESSION TESTS
Synthetic Workloads to stress test the Replication Pipeline
Failovers every 90 minutes
Discovered some interesting corner cases
Page-table corruptions in memory checkpoints
Write-after-write I/O ordering in disk replication
![Page 28: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/28.jpg)
28
SECONDSITE – THE COMPLETE PICTURE
• Service Downtime includes timeout for failure detection (10s)• Failure Detection Timeout is configurable
4 VMs x 100 Clients/VM
![Page 29: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/29.jpg)
29
REPLICATION BANDWIDTH CONSUMPTION
4 VMs x 100 Clients/VM
![Page 30: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/30.jpg)
30
DEMO
Expect a real disaster (conference demos are not a good idea!)
![Page 31: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/31.jpg)
31
APPLICATION THROUGHPUT VS. REPLICATION LATENCY
SPECWeb w/ 100 Clients
Kamloops
![Page 32: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/32.jpg)
32
RESOURCE UTILIZATION VS. APPLICATION LOAD
Domain-0 CPU Utilization Bandwidth usage on Replication Channel
Cost of HA as a function of Application Load (OLTP w/ 100 Clients)
![Page 33: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/33.jpg)
33
RESYNCHRONIZATION DELAYS VS. OUTAGE PERIOD
OLTP Workload
![Page 34: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/34.jpg)
34
The user creates a recovery plan which is associated to a single or multiple protection groups
SETUP WORKFLOW – RECOVERY SITE
Source: VMWare Site Recovery Manager – Technical Overview
![Page 35: S ECOND S ITE : D ISASTER T OLERANCE AS A S ERVICE Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield](https://reader036.vdocuments.us/reader036/viewer/2022062515/56649ca45503460f94964fcf/html5/thumbnails/35.jpg)
35
RECOVERY PLAN
VM Shutdown
High PriorityVM Recovery
Prepare Storage
High PriorityVM Shutdown
Normal PriorityVM Recovery
Source: VMWare Site Recovery Manager – Technical Overview
Low PriorityVM Recovery