srecon-americas-2017: trafficshift: avoiding disasters at scale

30
TrafficShift - Avoiding Disasters at Scale Michael Kehoe Staff SRE LinkedIn Anil Mallapur SRE LinkedIn

Upload: michael-kehoe

Post on 11-Apr-2017

170 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

TrafficShift - Avoiding Disasters at Scale

Michael KehoeStaff SRELinkedIn

Anil MallapurSRELinkedIn

Page 2: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

OverviewLinkedIn Architectural Overview

Fabric Disaster Recovery

Questions

Page 3: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

467+ million members

World’s largest professional network

200+ Countries

Page 4: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Who are we ?Production-SRE team at LinkedIn

● Assist in restoring stability to services during site critical issues

● Developing applications to improve MTTD and MTTR

● Provide direction and guidelines for site monitoring

● Build tools for efficient site issue troubleshooting, issue detection & correlation

Page 5: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Terminologies

Fabric/Colo Data Center with full application stack deployed

PoP/Edge Entry point to LinkedIn network (TCP/ SSL termination)

Load Test Planned stress testing of data centers

Page 6: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

2003

2010

2011

2013

2014

2015

Active & Passive

Active &Active

Multi-colo 3-way Active

&Active

Multi-colo n-way Active

&Active

Page 7: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

2017

4 Data Centers 13 PoPs

1000+ service

s

Page 8: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale
Page 9: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

What are Disasters ?

Service Degradatio

n Infrastructu

re IssuesHuman Error

Data Center on

Fire

Page 10: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

One solution for all disasters

TrafficShift - Reroute user traffic to different

datacenters without any user interruption.

Michael Kehoe
Need to briefly mention that in a microservices environment, partial fail-out for internal API's is difficult
Page 11: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Whaaaat ?

Page 12: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Border Router

IPVS ATS

EDGE

ATS Frontend

FABRIC

Stickyrouting Service

Internet

Page 13: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

ATS

Request

Stickyrouting Service

Gets primary colo for user

If not cookie in header

DC1 in cookie DC1

DC2

Got DC2 as primary colo for

user

FABRICEDGE

Page 14: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

US-East

1 2 3 10

91 92 93 100

BUCKETSFABRIC

Stickyrouting

Page 15: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

How StickyRouting assigns users to a colo?

Capacity of Fabric

Offline job to assign colo to users

Geographic distance to users

Page 16: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Advantages of sticky routing

Less latency for users

Store data where it’s necessary

Provides precise control over capacity allotment

Page 17: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

When to TrafficShift ?

Impact Mitigation

Planned Maintenan

ceStress Test

Page 18: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Site Traffic and Disaster Recovery

US-West US-Central

US-East APAC

EDGE

0%Distributed Load

50%Distributed Load

50%Distributed Load

0%Distributed Load

Traffic stops being served to offline

fabricsTraffic is shifted to

online fabrics

Page 19: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

TrafficShift Architecture

Web application

Salt master

Stickyrouting ServiceCouchbase Backend Worker

Processes

FABRIC

BUCKETS

Page 20: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

What is Load Testing ?

3 times a week

Peak hour traffic

Fixed SLA

USW

Target Data Center

USW

Page 21: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Load Testing

FABRIC

Target

US-West US-East

50%

Traffic Percentage

Page 22: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Benefits of Load Test

Capacity PlanningLeverage production traffic to stress test

services

Identify bugs in production

Confidence in Disaster Recovery

Page 23: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Big Red Button

Kill switch (No Kidding)Failout of a datacenter and PoP in less than 10 minutesMinimal user impact

Michael Kehoe
Rephrase this slightly. talk to me
Anil Kumar Ravindra Mallapur
Add a white skull
Page 24: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Key Takeaways●Design infrastructure to facilitate

disaster recovery

●Stress test regularly to avoid surprises

●Automate everything to reduce time to mitigate impact

Page 25: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Questions

Page 26: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale
Page 27: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Edge Failout

Page 28: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

Edge Presence

Page 29: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

 LinkedIn’s PoP Architecture

29

• Using IPVS - Each PoP announces a unicast address and a regional anycast address

• APAC, EU and NAMER anycast regions

• Use GeoDNS to steer users to the ‘best’ PoP

• DNS will either provide users with an anycast or unicast address for www.linkedin.com

• US and EU members is nearly all anycast• APAC is all unicast

Page 30: SRECon-Americas-2017: Trafficshift: Avoiding disasters at scale

 LinkedIn’s PoP DR

30

• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit

links going down)• Infrastructure maintenance

• Withdraw anycast route announcements

• Fail healthchecks on proxy to drain unicast traffic

Michael Kehoe
Edge shift process diagrams