apricot 2017: trafficshifting: avoiding disasters & improving performance at scale

35
Trafficshifting: Avoiding Disasters & Improving Performance at Scale Michael Kehoe Staff Site Reliability Engineer LinkedIn

Upload: michael-kehoe

Post on 11-Apr-2017

146 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Trafficshifting: Avoiding Disasters & Improving Performance at Scale

Michael KehoeStaff Site Reliability EngineerLinkedIn

Page 2: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

2

Overview

• Problem Statement• Solution – How LinkedIn trafficshift’s

• Datacenter shifting• PoP steering

• Challenges of APAC region• IPv4 vs IPv6 • Questions

Page 3: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

3

 $ whoamiMichael Kehoe

• Staff Site Reliability Engineer (SRE) @ LinkedIn• Production-SRE team• Funny accent = Australian + 3 years American

Page 4: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

4

 $ whatis SREMichael Kehoe

• Site Reliability Engineering• Operations for the production application

environment• Responsibilities include

• Architecture design• Capacity planning• Operations• Tooling

• Responsibilities include DNS/ CDN management & Traffic infrastructure

Page 5: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

5

Terminology

• PoP - Where LinkedIn terminates incoming requests.

• Fabric – Datacenter with full LinkedIn production stack deployed

• Loadtest – Stress test of a Fabric – to simulate a disaster scenario

Page 6: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

6

 Disaster RecoveryProblem Statement

• Fail between Fabrics• Performance of applications is degraded• Validate disaster recovery (DR) scenario• Expose bugs and suboptimal configurations via loadtest• Planned maintenance

• Fail between PoP’s• Mitigate impact of a 3rd party provider maintenance/ failure (e.g. transport links)• Software/ Configuration Bugs

Page 7: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

7

 PerformanceProblem Statement

• Fabric Assignment• Assign preferred and secondary fabric to all members based on:

• Member location• Capacity

• PoP/ CDN steering• Use GeoDNS to steer user to ‘best’ PoP• Use RUM DNS to steer users to ’best’ CDN

Page 8: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

8

 United States Performance (Global)Problem Statement

Page 9: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

9

 APAC Performance (APAC cities)Problem Statement

Page 10: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

10

 Delta US & APACProblem Statement

Page 11: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

11

 Site SpeedProblem Statement

• Site Speed affects User Engagement• User Engagement affects page-views & transactions

• Bottom Line: Site Speed has an impact on revenue

Page 12: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

12

 LinkedIn’s Traffic ArchitectureSolution

Page 13: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

13

 LinkedIn’s Traffic ArchitectureSolution

Page 14: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

14

 Fabric shiftingSolution

• Stickyrouting• Using a Hadoop job, we calculate a primary and

secondary datacenter for the user based on location

• This data is stored in a Key-Value store (Espresso)

• Stickyrouting serves this information over a RESTful interface to our Edge PoP’s

Page 15: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

15

 Fabric shiftingSolution

• Different traffic types are partitioned and controlled separately• Logged-In vs Logged-out• CDN’s• Monitoring• Microsites

• Logged-in users are placed into ‘buckets’• Buckets are marked online/ offline to move site traffic

Page 16: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

16

 Fabric shiftingSolution

• Stickyrouting – Benefits• Ensure we serve the request as close to the user as possible• Capacity management for datacenters

• We can assign a percentage of users to a datacenter• Enables personal data routing (PDR)

• Only store data where we need it

Page 17: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

17

 Fabric shifting AutomationSolution

Page 18: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

18

 Fabric shifting AutomationSolution

Page 19: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

19

 Fabric ShiftingSolution

Page 20: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

20

 Fabric Shifting Load testsSolution

Page 21: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

21

 Fabric Shifting LoadtestsSolution

Page 22: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

22

 LinkedIn’s Traffic ArchitectureSolution

Page 23: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

23

 LinkedIn’s PoP DistributionSolution

Page 24: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

24

 LinkedIn’s PoP ArchitectureSolution

• Using IPVS - Each PoP announces a unicast address and a regional anycast address

• APAC, EU and NAMER anycast regions

• Use GeoDNS to steer users to the ‘best’ PoP

• DNS will either provide users with an anycast or unicast address for www.linkedin.com

• US and EU members is nearly all anycast• APAC is all unicast

Page 25: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

25

 LinkedIn’s PoP DRSolution

• Sometimes need to fail out of PoP’s• 3rd party provider issues (e.g. transit links

going down)• Infrastructure maintenance

• Withdraw anycast route announcements

• Fail healthchecks on proxy to drain unicast traffic

Page 26: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

26

 LinkedIn’s PoP PerformanceSolution

• PoP DNS Steering• LinkedIn currently uses GeoDNS for routing• Piloting RumDNS

• Pick the best PoP based on network, not country

• CDN Steering• Mix CDN’s to get best performance• Constantly evaluate performance/ availability• Automatically adjust CDN weighting

Page 27: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

27

 LinkedIn’s PoP PerformanceSolution

US CDN request time 50th percentile 24 hours

Page 28: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

28

 Working around fiber cutsAPAC Challenges

• Case Study: Fail out of India PoP due to fiber cuts

ConnectionTimeforIndianmembers(90thpercentile)

Page 29: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

29

ASN15802

ASN5384

 GeoDNS Suboptimal PoP’sAPAC Challenges

Source:http://www.submarinecablemap.com/#/submarine-cable/bay-of-bengal-gateway-bbg

SingaporeMumbai

45ms220ms

70ms

ASN15802RTTtoSingaporeis(220+70)290ms(allat50thpercentile)

Page 30: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

30

 GeoDNS Suboptimal PoP’sAPAC Challenges

LondonDublin

SingaporeMumbai

160ms

45ms

ASN15802

ASN5384

70ms

35ms

350ms

Hong Kong160ms

Page 31: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

31

 GeoDNS Suboptimal PoP’sAPAC Challenges

600

700

800

900

1000

1100

1200

Page 32: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

32

 Performance & AdoptionIPv4 vs IPv6

• IPv6 performs better for our members• Less request time-outs on IPv6 for mobile users• Mobile carriers are adopting IPv6 faster• Win for LinkedIn and our members!

• In July 2014 (IPv6 launch): 3% of traffic was IPv6

• Today: ~12% of traffic is IPv6

Page 33: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

33

 Key TakeawaysConclusion

• Application level traffic engineering is extremely important for content providers

• RUM data is extremely useful for finding anomalies

• Route traffic based on performance, not just location

• IPv6 performs better for LinkedIn users

Page 34: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale

34

Questions?

Page 35: APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale