srecon-americas-2017: trafficshift: avoiding disasters at scale

TrafficShift - Avoiding Disasters at Scale

Michael KehoeStaff SRELinkedIn

Anil MallapurSRELinkedIn

OverviewLinkedIn Architectural Overview

Fabric Disaster Recovery

Questions

467+ million members

World’s largest professional network

200+ Countries

Who are we ?Production-SRE team at LinkedIn

● Assist in restoring stability to services during site critical issues

● Developing applications to improve MTTD and MTTR

● Provide direction and guidelines for site monitoring

● Build tools for efficient site issue troubleshooting, issue detection & correlation

Terminologies

Fabric/Colo Data Center with full application stack deployed

PoP/Edge Entry point to LinkedIn network (TCP/ SSL termination)

Load Test Planned stress testing of data centers

Active & Passive

Active &Active

Multi-colo 3-way Active

&Active

Multi-colo n-way Active

&Active

4 Data Centers 13 PoPs

1000+ service

What are Disasters ?

Service Degradatio

n Infrastructu

re IssuesHuman Error

Data Center on

One solution for all disasters

TrafficShift - Reroute user traffic to different

datacenters without any user interruption.

Whaaaat ?

Border Router

IPVS ATS

ATS Frontend

FABRIC

Stickyrouting Service

Internet

Request

Stickyrouting Service

Gets primary colo for user

If not cookie in header

DC1 in cookie DC1

Got DC2 as primary colo for

FABRICEDGE

US-East

1 2 3 10

91 92 93 100

BUCKETSFABRIC

Stickyrouting

How StickyRouting assigns users to a colo?

Capacity of Fabric

Offline job to assign colo to users

Geographic distance to users

Advantages of sticky routing

Less latency for users

Store data where it’s necessary

Provides precise control over capacity allotment

When to TrafficShift ?

Impact Mitigation

Planned Maintenan

ceStress Test

Site Traffic and Disaster Recovery

US-West US-Central

US-East APAC

0%Distributed Load

50%Distributed Load

0%Distributed Load

Traffic stops being served to offline

fabricsTraffic is shifted to

online fabrics

TrafficShift Architecture

Web application

Salt master

Stickyrouting ServiceCouchbase Backend Worker

Processes

FABRIC

BUCKETS

What is Load Testing ?

3 times a week

Peak hour traffic

Fixed SLA

Target Data Center

Load Testing

FABRIC

Target

US-West US-East

Traffic Percentage

Benefits of Load Test

Capacity PlanningLeverage production traffic to stress test

services

Identify bugs in production

Confidence in Disaster Recovery

Big Red Button

Kill switch (No Kidding)Failout of a datacenter and PoP in less than 10 minutesMinimal user impact

Key Takeaways●Design infrastructure to facilitate

disaster recovery

●Stress test regularly to avoid surprises

●Automate everything to reduce time to mitigate impact

Questions

Edge Failout

Edge Presence

LinkedIn’s PoP Architecture

• Using IPVS - Each PoP announces a unicast address and a regional anycast address

• APAC, EU and NAMER anycast regions

• Use GeoDNS to steer users to the ‘best’ PoP

• DNS will either provide users with an anycast or unicast address for www.linkedin.com

• US and EU members is nearly all anycast• APAC is all unicast

LinkedIn’s PoP DR

• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit

links going down)• Infrastructure maintenance

• Withdraw anycast route announcements

• Fail healthchecks on proxy to drain unicast traffic

srecon-americas-2017: trafficshift: avoiding disasters at scale

Engineering

avoiding mobile app development disasters

srecon 2016 - ingraph of the week

“siting and planning projects: avoiding the pr...

avoiding quality control disasters with machine vision

surviving disasters 101 - manmade or natural disasters

design review best practices - srecon 2014

avoiding project disasters titanic lessons for modern...

in service inspections avoiding disasters! - aog · in...

the psychology of avoiding disaster readiness disasters...

know thy enemy - usenix...matt brown i’m a kiwi! live &...

avoiding emotional investing presentation avoiding emotional...

avoiding itstrategies and investment disasters

welcome/overview credentials disasters major global...

event planning 101: avoiding event day disasters presented...

avoiding data center disasters: what professionals need to...

srecon eu 2016: riot games vs the internet

conf. 404- effective risk management and avoiding project...

avoiding quality control disasters with machine vision ·...

apricot 2017: trafficshifting: avoiding disasters &...

avoiding project disasters titanic lessons for projects ·...