apricot 2017: trafficshifting: avoiding disasters & improving performance at scale

Post on 11-Apr-2017

146 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Michael KehoeStaff Site Reliability EngineerLinkedIn

2

Overview

• Problem Statement• Solution – How LinkedIn trafficshift’s

• Datacenter shifting• PoP steering

• Challenges of APAC region• IPv4 vs IPv6 • Questions

3

 $ whoamiMichael Kehoe

• Staff Site Reliability Engineer (SRE) @ LinkedIn• Production-SRE team• Funny accent = Australian + 3 years American

4

 $ whatis SREMichael Kehoe

• Site Reliability Engineering• Operations for the production application

environment• Responsibilities include

• Architecture design• Capacity planning• Operations• Tooling

• Responsibilities include DNS/ CDN management & Traffic infrastructure

5

Terminology

• PoP - Where LinkedIn terminates incoming requests.

• Fabric – Datacenter with full LinkedIn production stack deployed

• Loadtest – Stress test of a Fabric – to simulate a disaster scenario

6

 Disaster RecoveryProblem Statement

• Fail between Fabrics• Performance of applications is degraded• Validate disaster recovery (DR) scenario• Expose bugs and suboptimal configurations via loadtest• Planned maintenance

• Fail between PoP’s• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)• Software/ Configuration Bugs

7

 PerformanceProblem Statement

• Fabric Assignment• Assign preferred and secondary fabric to all members based on:

• Member location• Capacity

• PoP/ CDN steering• Use GeoDNS to steer user to ‘best’ PoP• Use RUM DNS to steer users to ’best’ CDN

8

 United States Performance (Global)Problem Statement

9

 APAC Performance (APAC cities)Problem Statement

10

 Delta US & APACProblem Statement

11

 Site SpeedProblem Statement

• Site Speed affects User Engagement• User Engagement affects page-views & transactions

• Bottom Line: Site Speed has an impact on revenue

12

 LinkedIn’s Traffic ArchitectureSolution

13

 LinkedIn’s Traffic ArchitectureSolution

14

 Fabric shiftingSolution

• Stickyrouting• Using a Hadoop job, we calculate a primary and

secondary datacenter for the user based on location

• This data is stored in a Key-Value store (Espresso)

• Stickyrouting serves this information over a RESTful interface to our Edge PoP’s

15

 Fabric shiftingSolution

• Different traffic types are partitioned and controlled separately• Logged-In vs Logged-out• CDN’s• Monitoring• Microsites

• Logged-in users are placed into ‘buckets’• Buckets are marked online/ offline to move site traffic

16

 Fabric shiftingSolution

• Stickyrouting – Benefits• Ensure we serve the request as close to the user as possible• Capacity management for datacenters

• We can assign a percentage of users to a datacenter• Enables personal data routing (PDR)

• Only store data where we need it

17

 Fabric shifting AutomationSolution

18

 Fabric shifting AutomationSolution

19

 Fabric ShiftingSolution

20

 Fabric Shifting Load testsSolution

21

 Fabric Shifting LoadtestsSolution

22

 LinkedIn’s Traffic ArchitectureSolution

23

 LinkedIn’s PoP DistributionSolution

24

 LinkedIn’s PoP ArchitectureSolution

• Using IPVS - Each PoP announces a unicast address and a regional anycast address

• APAC, EU and NAMER anycast regions

• Use GeoDNS to steer users to the ‘best’ PoP

• DNS will either provide users with an anycast or unicast address for www.linkedin.com

• US and EU members is nearly all anycast• APAC is all unicast

25

 LinkedIn’s PoP DRSolution

• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit links

going down)• Infrastructure maintenance

• Withdraw anycast route announcements

• Fail healthchecks on proxy to drain unicast traffic

26

 LinkedIn’s PoP PerformanceSolution

• PoP DNS Steering• LinkedIn currently uses GeoDNS for routing• Piloting RumDNS

• Pick the best PoP based on network, not country

• CDN Steering• Mix CDN’s to get best performance• Constantly evaluate performance/ availability• Automatically adjust CDN weighting

27

 LinkedIn’s PoP PerformanceSolution

US CDN request time 50th percentile 24 hours

28

 Working around fiber cutsAPAC Challenges

• Case Study: Fail out of India PoP due to fiber cuts

ConnectionTimeforIndianmembers(90thpercentile)

29

ASN15802

ASN5384

 GeoDNS Suboptimal PoP’sAPAC Challenges

Source:http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg

SingaporeMumbai

45ms220ms

70ms

ASN15802RTTtoSingaporeis(220+70)290ms(allat50thpercentile)

30

 GeoDNS Suboptimal PoP’sAPAC Challenges

LondonDublin

SingaporeMumbai

160ms

45ms

ASN15802

ASN5384

70ms

35ms

350ms

Hong Kong160ms

31

 GeoDNS Suboptimal PoP’sAPAC Challenges

600

700

800

900

1000

1100

1200

32

 Performance & AdoptionIPv4 vs IPv6

• IPv6 performs better for our members• Less request time-outs on IPv6 for mobile users• Mobile carriers are adopting IPv6 faster• Win for LinkedIn and our members!

• In July 2014 (IPv6 launch): 3% of traffic was IPv6

• Today: ~12% of traffic is IPv6

33

 Key TakeawaysConclusion

• Application level traffic engineering is extremely important for content providers

• RUM data is extremely useful for finding anomalies

• Route traffic based on performance, not just location

• IPv6 performs better for LinkedIn users

34

Questions?

top related