apricot 2017: trafficshifting: avoiding disasters & improving performance at scale

Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Michael KehoeStaff Site Reliability EngineerLinkedIn

Overview

• Problem Statement• Solution – How LinkedIn trafficshift’s

• Datacenter shifting• PoP steering

• Challenges of APAC region• IPv4 vs IPv6 • Questions

$ whoamiMichael Kehoe

• Staff Site Reliability Engineer (SRE) @ LinkedIn• Production-SRE team• Funny accent = Australian + 3 years American

$ whatis SREMichael Kehoe

• Site Reliability Engineering• Operations for the production application

environment• Responsibilities include

• Architecture design• Capacity planning• Operations• Tooling

• Responsibilities include DNS/ CDN management & Traffic infrastructure

Terminology

• PoP - Where LinkedIn terminates incoming requests.

• Fabric – Datacenter with full LinkedIn production stack deployed

• Loadtest – Stress test of a Fabric – to simulate a disaster scenario

Disaster RecoveryProblem Statement

• Fail between Fabrics• Performance of applications is degraded• Validate disaster recovery (DR) scenario• Expose bugs and suboptimal configurations via loadtest• Planned maintenance

• Fail between PoP’s• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)• Software/ Configuration Bugs

PerformanceProblem Statement

• Fabric Assignment• Assign preferred and secondary fabric to all members based on:

• Member location• Capacity

• PoP/ CDN steering• Use GeoDNS to steer user to ‘best’ PoP• Use RUM DNS to steer users to ’best’ CDN

United States Performance (Global)Problem Statement

APAC Performance (APAC cities)Problem Statement

Delta US & APACProblem Statement

Site SpeedProblem Statement

• Site Speed affects User Engagement• User Engagement affects page-views & transactions

• Bottom Line: Site Speed has an impact on revenue

LinkedIn’s Traffic ArchitectureSolution

Fabric shiftingSolution

• Stickyrouting• Using a Hadoop job, we calculate a primary and

secondary datacenter for the user based on location

• This data is stored in a Key-Value store (Espresso)

• Stickyrouting serves this information over a RESTful interface to our Edge PoP’s

• Different traffic types are partitioned and controlled separately• Logged-In vs Logged-out• CDN’s• Monitoring• Microsites

• Logged-in users are placed into ‘buckets’• Buckets are marked online/ offline to move site traffic

• Stickyrouting – Benefits• Ensure we serve the request as close to the user as possible• Capacity management for datacenters

• We can assign a percentage of users to a datacenter• Enables personal data routing (PDR)

• Only store data where we need it

Fabric shifting AutomationSolution

Fabric ShiftingSolution

Fabric Shifting Load testsSolution

Fabric Shifting LoadtestsSolution

LinkedIn’s Traffic ArchitectureSolution

LinkedIn’s PoP DistributionSolution

LinkedIn’s PoP ArchitectureSolution

• Using IPVS - Each PoP announces a unicast address and a regional anycast address

• APAC, EU and NAMER anycast regions

• Use GeoDNS to steer users to the ‘best’ PoP

• DNS will either provide users with an anycast or unicast address for www.linkedin.com

• US and EU members is nearly all anycast• APAC is all unicast

LinkedIn’s PoP DRSolution

• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit links

going down)• Infrastructure maintenance

• Withdraw anycast route announcements

• Fail healthchecks on proxy to drain unicast traffic

LinkedIn’s PoP PerformanceSolution

• PoP DNS Steering• LinkedIn currently uses GeoDNS for routing• Piloting RumDNS

• Pick the best PoP based on network, not country

• CDN Steering• Mix CDN’s to get best performance• Constantly evaluate performance/ availability• Automatically adjust CDN weighting

LinkedIn’s PoP PerformanceSolution

US CDN request time 50th percentile 24 hours

Working around fiber cutsAPAC Challenges

• Case Study: Fail out of India PoP due to fiber cuts

ConnectionTimeforIndianmembers(90thpercentile)

ASN15802

ASN5384

GeoDNS Suboptimal PoP’sAPAC Challenges

Source:http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg

SingaporeMumbai

45ms220ms

ASN15802RTTtoSingaporeis(220+70)290ms(allat50thpercentile)

LondonDublin

SingaporeMumbai

ASN15802

ASN5384

Hong Kong160ms

Performance & AdoptionIPv4 vs IPv6

• IPv6 performs better for our members• Less request time-outs on IPv6 for mobile users• Mobile carriers are adopting IPv6 faster• Win for LinkedIn and our members!

• In July 2014 (IPv6 launch): 3% of traffic was IPv6

• Today: ~12% of traffic is IPv6

Key TakeawaysConclusion

• Application level traffic engineering is extremely important for content providers

• RUM data is extremely useful for finding anomalies

• Route traffic based on performance, not just location

• IPv6 performs better for LinkedIn users

Questions?

apricot 2017: trafficshifting: avoiding disasters & improving performance at scale

Engineering

“siting and planning projects: avoiding the pr...

conceptual framework for disaster/climate risk … · 2018....

katsiamakas - apricot

apricot bachelor-pad

avoiding data center disasters

in service inspections avoiding disasters! - aog · in...

nsrc@apricot 2010 mrtg and rrdtool apricot 2010 kuala...

natural disasters. what are natural disasters? natural...

apricot media 1.1

avoiding quality control disasters with machine vision

event planning 101: avoiding event day disasters presented...

australia awards in indonesia sta... · 6.9 natural...

apricot service manual - mirror service · apricot service...

lessons from space shuttle disasters for avoiding it ... ·...

wild apricot

letter from apricot (apricot hotel, hanoi)

avoiding quality control disasters with machine vision ·...

apricot - apricot injection, solution apple - apple...

avoiding data center disasters: what professionals need to...

eclipse apricot