zero-downtime datacenter failovers · 2020. 4. 21. · zero-downtime datacenter failovers...
TRANSCRIPT
![Page 1: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/1.jpg)
ZERO-DOWNTIME DATACENTER FAILOVERS(SWITCHING HOSTING PROVIDERS FOR DUMMIES)
1 — Luka Kladaric @ AWS Adria 2017.
![Page 2: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/2.jpg)
WHO?Luka Kladaric
formerly a web developer for >10 years
now: freelancing, consulting, architecting, securing
2 — Luka Kladaric @ AWS Adria 2017.
![Page 3: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/3.jpg)
migrating an entire company's infrastructure
from Rackspace to Amazon AWS
3 — Luka Kladaric @ AWS Adria 2017.
![Page 4: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/4.jpg)
60 virtual machines
3 baremetal boxes (db)
assorted networking equipment
4 — Luka Kladaric @ AWS Adria 2017.
![Page 5: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/5.jpg)
the migration took 2 months to execute
but a year and a half to prepare
5 — Luka Kladaric @ AWS Adria 2017.
![Page 6: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/6.jpg)
FOUND STATE6 — Luka Kladaric @ AWS Adria 2017.
![Page 7: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/7.jpg)
hand-crafted build server, unreproducible
7 — Luka Kladaric @ AWS Adria 2017.
![Page 8: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/8.jpg)
half the servers are not deployable from scratch
or their deployability is unknown
8 — Luka Kladaric @ AWS Adria 2017.
![Page 9: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/9.jpg)
same mysql account used by everyone everywhere
9 — Luka Kladaric @ AWS Adria 2017.
![Page 10: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/10.jpg)
that mysql account is "root"
10 — Luka Kladaric @ AWS Adria 2017.
![Page 11: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/11.jpg)
that mysql db is 1.5 TB big
11 — Luka Kladaric @ AWS Adria 2017.
![Page 12: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/12.jpg)
no access to LB config
has a bunch of magic in it
changes often result in issues and outages
12 — Luka Kladaric @ AWS Adria 2017.
![Page 13: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/13.jpg)
no server metrics / perfdata
no idea if overprovisioned and by how much
13 — Luka Kladaric @ AWS Adria 2017.
![Page 14: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/14.jpg)
no access to disaster recovery instancein case the primary DC went down
(access goes through primary DC)
14 — Luka Kladaric @ AWS Adria 2017.
![Page 15: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/15.jpg)
RACKSPACE WAS REALLY TERRIBLEa constant pain to deal with
unexpected outages of never explained causes
unresponsive support team
zero flexibility
15 — Luka Kladaric @ AWS Adria 2017.
![Page 16: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/16.jpg)
HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months
conservatively: 6-9 months
realistically: a year
16 — Luka Kladaric @ AWS Adria 2017.
![Page 17: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/17.jpg)
NO LEADERSHIP BUY-IN2 failed attempts to get approval
Infrastructure team makes a pact"Do Things The Right Way From Now On"
mask cleanup work with ongoing maintenance
17 — Luka Kladaric @ AWS Adria 2017.
![Page 18: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/18.jpg)
A YEAR AND A HALF LATER...
majority of the issues were fixed
or at least significantly improved
18 — Luka Kladaric @ AWS Adria 2017.
![Page 19: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/19.jpg)
PLOT TWISTRACKSPACE STARTS FALLING APART
19 — Luka Kladaric @ AWS Adria 2017.
![Page 20: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/20.jpg)
New estimate: 19 man-days
(after final push for preparation)
20 — Luka Kladaric @ AWS Adria 2017.
![Page 21: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/21.jpg)
SAVINGS ESTIMATE
$18k -> $6k
that's -66%
21 — Luka Kladaric @ AWS Adria 2017.
![Page 22: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/22.jpg)
GOT APPROVAL!22 — Luka Kladaric @ AWS Adria 2017.
![Page 23: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/23.jpg)
Actually executed in 25-30 man-days
over 2 months
23 — Luka Kladaric @ AWS Adria 2017.
![Page 24: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/24.jpg)
HOW?24 — Luka Kladaric @ AWS Adria 2017.
![Page 25: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/25.jpg)
"upgrading the fleet to Ubuntu 16.04"
all servers rebuilt and redeployed with Ansible
25 — Luka Kladaric @ AWS Adria 2017.
![Page 26: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/26.jpg)
build server rebuilt from scratch
deployed from Ansible
all build jobs defined in code
no more tweaking jobs through UI
26 — Luka Kladaric @ AWS Adria 2017.
![Page 27: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/27.jpg)
CloudFlare implemented for faster DNS failover
27 — Luka Kladaric @ AWS Adria 2017.
![Page 28: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/28.jpg)
all LB logic slowly moved to our own haproxies
haproxy configuration auto-generated from Ansible
makes it easy to shuffle things around
28 — Luka Kladaric @ AWS Adria 2017.
![Page 29: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/29.jpg)
all apps slowly migrated to be served through haproxies
avoiding Rackspace LB magic
29 — Luka Kladaric @ AWS Adria 2017.
![Page 30: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/30.jpg)
VPN bridge between DCs~20 MB/s, ~20ms ping
good enough to treat as a "local" connectionfor shorter periods of time
30 — Luka Kladaric @ AWS Adria 2017.
![Page 31: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/31.jpg)
mysql master-master replication between DCs
31 — Luka Kladaric @ AWS Adria 2017.
![Page 32: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/32.jpg)
app servers in both DCs
32 — Luka Kladaric @ AWS Adria 2017.
![Page 33: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/33.jpg)
haproxies in both DCs
aware of app servers in both DCsbut preferring local ones
"no request left behind"
33 — Luka Kladaric @ AWS Adria 2017.
![Page 34: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/34.jpg)
failover with DNS at CloudFlare near-instantly
but even stray requests get handled
34 — Luka Kladaric @ AWS Adria 2017.
![Page 35: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/35.jpg)
metrics, metrics, metrics
(Datadog ftw)
35 — Luka Kladaric @ AWS Adria 2017.
![Page 36: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/36.jpg)
RESULTS36 — Luka Kladaric @ AWS Adria 2017.
![Page 37: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/37.jpg)
core production migrated in days
internal tools migrated within a week or two
developer tools migrated within a month(git hosting, build server, etc)
obscure legacy services migrated within 2 months
37 — Luka Kladaric @ AWS Adria 2017.
![Page 38: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/38.jpg)
all hardware at Rackspacedecomissioned within 3 months
38 — Luka Kladaric @ AWS Adria 2017.
![Page 39: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/39.jpg)
sideffect: actual HA instead of fake HA
old "two or more of everything" approachtranslated well into Availability Zones
39 — Luka Kladaric @ AWS Adria 2017.
![Page 40: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/40.jpg)
AND IT WAS GOOD40 — Luka Kladaric @ AWS Adria 2017.
![Page 41: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/41.jpg)
41 — Luka Kladaric @ AWS Adria 2017.
![Page 42: ZERO-DOWNTIME DATACENTER FAILOVERS · 2020. 4. 21. · ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) 1 — Luka Kladaric @ AWS Adria 2017. ... the migration](https://reader033.vdocuments.us/reader033/viewer/2022052814/609f2e8f02f91d1a353fd0b6/html5/thumbnails/42.jpg)
QUESTIONS?42 — Luka Kladaric @ AWS Adria 2017.