advancing the art of internet edge outage detection (slides) › richterp ›...

Post on 06-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advancing the Art of Internet Edge Outage Detection

Philipp Richter MIT / Akamai

Ramakrishna Padmanabhan University of Maryland

Neil Spring University of Maryland

Arthur Berger Akamai / MIT

David Clark MIT

ACM Internet Measurement Conference 2018

Internet Edge Outages

1

Internet Edge Outages

➡ Uninterrupted Internet availability becomes increasingly critical ➡ Growing interest in systems to detect Internet edge outages➡ Regulatory Bodies, Governments, ISPs, Academic Research

1

Internet Edge Outages

➡ Uninterrupted Internet availability becomes increasingly critical ➡ Growing interest in systems to detect Internet edge outages➡ Regulatory Bodies, Governments, ISPs, Academic Research

Goal: Track Internet edge outages on (i) a broad scale and (ii) a detailed level

1

Edge Outage Detection: Existing Approaches

• Detecting outages in the control plane? ➡ Edge outages often invisible in BGP

• Deploying hardware in end user premises? ➡ Potentially highly accurate, but does not scale

• Actively probing addresses? ➡ Challenging to scale, unresponsive addresses, difficult to interpret

2

Edge Outage Detection: Existing Approaches

• Detecting outages in the control plane? ➡ Edge outages often invisible in BGP

• Deploying hardware in end user premises? ➡ Potentially highly accurate, but does not scale

• Actively probing addresses? ➡ Challenging to scale, unresponsive addresses, difficult to interpret

This work: Passive edge outage detection based on CDN request patterns

2

Outline

•Disruption Detection

•Global View on Disruptions

•Device-Centric View on Disruptions

•U.S. Broadband Case Study

Disruption Detection using CDN Access Logs

200,000+ servers1500+ ASes120+ countries> 3 trillion daily requests

Dataset: Hourly request counts per IPv4 /24 address block

3

Disruption Detection using CDN Access Logs

200,000+ servers1500+ ASes120+ countries> 3 trillion daily requests

Dataset: Hourly request counts per IPv4 /24 address block

Assumption: Edge outages will be reflected in absence/reduction of CDN requests.

3

Active IP Addresses per /24

hours

activ

e IP

v4 a

ddre

sses

per

hou

r

1 168 336 504 672

050

100

150

200

250

typical residential /24 address block

4

Active IP Addresses per /24: Baseline Activity

hours

activ

e IP

v4 a

ddre

sses

per

hou

r

1 168 336 504 672

050

100

150

200

250

typical residential /24 address block

baseline value:93 IP addresses

Number of addresses contacting the CDN every hour never drops below ‘baseline value’ value per block.

4

Active IP Addresses per /24: Baseline Activity

hours

activ

e IP

v4 a

ddre

sses

per

hou

r

1 168 336 504 672

050

100

150

200

250

typical residential /24 address block

baseline value:93 IP addresses

home networkCDN

content requests,update requests,

widgets, beacons, etc.

icons: mavadee / flaticon4

Active IP Addresses per /24: Baseline Activity

hours

activ

e IP

v4 a

ddre

sses

per

hou

r

1 168 336 504 672

050

100

150

200

250

typical residential /24 address block

baseline value:93 IP addresses

home networkCDN

content requests,update requests,

widgets, beacons, etc.

• Some 2.3M /24 address blocks (83% of all client IPs) baseline > 40• Robust signal, largely independent of user-triggered activity• Dependent on a functioning network 4

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

Disruption Detection

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

Disruption Detection

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

activity in next hour violates baseline criterion

Disruption Detection

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

baseline b1: sliding window

beta * min [last 168 hours]

Disruption Detection

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

Disruption Detection

baseline b1: sliding window

beta * min [last 168 hours]

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

new baseline found

Disruption Detection

baseline b1: sliding window

beta * min [last 168 hours]

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

}

non-steadystate

Disruption Detection

baseline b1: sliding window

beta * min [last 168 hours]

5

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

baseline b0: sliding window

alpha * min [last 168 hours]

}

Disruption Detection

non-steadystate

baseline b1: sliding window

beta * min [last 168 hours]

5

Did this Address Block really go offline?

hour [starting 2017−08−30]

activ

e IP

v4 a

ddre

sses

0 50 100 150 200 250 300 350

050

100

150

ICMP responsiveCDN active

6

Did this Address Block really go offline?

hour [starting 2017−08−30]

activ

e IP

v4 a

ddre

sses

0 50 100 150 200 250 300 350

050

100

150

ICMP responsiveCDN active

➡ “Address Space Survey Data” (provided by ISI) ➡ Ping every address in 1% of the allocated IPv4 /24s every 11 mins➡ Some address blocks show a very steady number of responsive IPs

6

hour [starting 2017−08−30]

activ

e IP

v4 a

ddre

sses

0 50 100 150 200 250 300 350

050

100

150

ICMP responsiveCDN active

Did this Address Block really go offline?

➡ “Address Space Survey Data” (provided by ISI) ➡ Ping every address in 1% of the allocated IPv4 /24s every 11 mins➡ Some address blocks show a very steady number of responsive IPs

6

CDN Disruptions vs. ICMP180.253.118.0/24

hour

activ

e IP

v4 a

ddre

sses

CDN active IPsICMP responsive IPs

0 168 336

050

100

150

200

250

170.250.45.0/24

hour

activ

e IP

v4 a

ddre

sses

CDN active IPsICMP responsive IPs

0 168 336

050

100

150

200

250

68.200.4.0/24

hour

activ

e IP

v4 a

ddre

sses

CDN active IPsICMP responsive IPs

0 168 336

050

100

150

200

250

73.204.202.0/24

hour

activ

e IP

v4 a

ddre

sses

CDN active IPsICMP responsive IPs

0 168 336

050

100

150

200

250

6

Global View on Disruptions

Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

0

4K(~0.35%)

8K(~0.7%)

hour

ly di

srup

ted

/24s partial /24 disrupted

entire /24 disrupted

1.5M disruption events over the course of one year

8

Global View on Disruptions

• At least 0.2% of the monitored space faces a disruption• Natural disasters, intended Internet shutdowns• Weekly pattern, mostly absent during Christmas/NYE

Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

0

4K(~0.35%)

8K(~0.7%)

hour

ly di

srup

ted

/24s partial /24 disrupted

entire /24 disrupted

1.5M disruption events over the course of one year

8

Global View on Disruptions

• At least 0.2% of the monitored space faces a disruption• Natural disasters, intended Internet shutdowns• Weekly pattern, mostly absent during Christmas/NYE

Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

0

4K(~0.35%)

8K(~0.7%)

hour

ly di

srup

ted

/24s partial /24 disrupted

entire /24 disrupted

1.5M disruption events over the course of one year

Hurricane Irma

8

Weekly Pattern

Mon Tue Wed Thu Fri Sat Sun

day of the week [local time]

fract

ion

of d

isru

ptio

n ev

ents

0.00

0.05

0.10

0.15

0.20

allentire /24

9

Weekly Pattern

Mon Tue Wed Thu Fri Sat Sun

day of the week [local time]

fract

ion

of d

isru

ptio

n ev

ents

0.00

0.05

0.10

0.15

0.20

allentire /24

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

start hour [local time]

fract

ion

of d

isru

ptio

n ev

ents

1 3 5 7 9 11 13 15 17 19 21 23

0.00

0.05

0.10

0.15

●● ●

● ●● ●

● ● ● ●● ● ● ● ● ●

allentire /24

9

Weekly Pattern: Scheduled Maintenance Window

Mon Tue Wed Thu Fri Sat Sun

day of the week [local time]

fract

ion

of d

isru

ptio

n ev

ents

0.00

0.05

0.10

0.15

0.20

allentire /24

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

start hour [local time]

fract

ion

of d

isru

ptio

n ev

ents

1 3 5 7 9 11 13 15 17 19 21 23

0.00

0.05

0.10

0.15

●● ●

● ●● ●

● ● ● ●● ● ● ● ● ●

allentire /24

9

Outline

•Disruption Detection

•Global View on Disruptions

•Device-Centric View on Disruptions

•U.S. Broadband Case Study

Device Perspective on Disruptions

hours

activ

e IP

v4 a

ddre

sses

1 168 336 504 672

050

100

150

200

➡ Established that this address block lost connectivity➡ Do devices really lose Internet connectivity? ➡ Or can they connect from somewhere else?

?

10

Device-Specific Dataset for Subset of Users

softwareID A software

ID Bsoftware

ID C

10

softwareID A software

ID Bsoftware

ID C

➡ Device-information for 52K “entire /24” disruption events➡ Only consider disruptions which affected an entire /24

(no activity during the disruption)

Device-Specific Dataset for Subset of Users

disruption in1.2.3.0/24

IPbefore

∈ 1.2.3.0/24

time

IPduring

∉ 1.2.3.0/24IPafter

∈/∉ 1.2.3.0/24

10

Device Perspective on Disruptions

disruptions with active IDs < 1hr before

N=52K

11

Device Perspective on Disruptions

disruptions with active IDs < 1hr before

N=52K

no active IDs during disruption

86%

expected case, devicesdo not have connectivity

11

Device Perspective on Disruptions

disruptions with active IDs < 1hr before

N=52K

no active IDs during disruption

86%

expected case, devicesdo not have connectivity

active IDs during disruption

from different IP address

14%

11

Device Perspective on Disruptions

disruptions with active IDs < 1hr before

N=52K

no active IDs during disruption

86%

expected case, devicesdo not have connectivity

active IDs during disruption

from different IP address

14%

switch to/from cellular switch AS

20% 13%

mobility and/or multi-homed devices

11

Device Perspective on Disruptions

disruptions with active IDs < 1hr before

N=52K

no active IDs during disruption

86%

expected case, devicesdo not have connectivity

active IDs during disruption

from different IP address

14%

switch to/from cellular switch AS same AS

20% 13%

mobility and/or multi-homed devices

67%

? (10% of disruptions)

11

Anti-Disruptions: Temporary Surges in Address Activity

IPs

in d

isru

pted

/alte

rnat

e /2

4

0 20 40 60 80 100

200

100

010

020

0

disrupted /24alternate /24

time [hours]

ID XYZ

ID XYZ

12

Anti-Disruptions: Temporary Surges in Address Activity

IPs

in d

isru

pted

/alte

rnat

e /2

4

0 20 40 60 80 100

200

100

010

020

0

disrupted /24alternate /24

time [hours]

➡ Some disruptions not service outages but address reassignment!

➡ Developed mechanism to detect anti-disruptions➡ Rank ASes by correlation of disruptions with anti-disruptions

ID XYZ

ID XYZ

12

AS-Level Disruptions and Anti-Disruptions

(ant

i−) d

isru

pted

IPs

010

0K25

0K

(ant

i−) d

isru

pted

IPs

40K

040

K

major US cable ISP: No correlation between disruptions and anti-disruptions (pearson r= 0.02)

Uruguayan ISP: Strong correlation between disruptions and anti-disruptions (pearson r= 0.63)

13

AS-Level Disruptions and Anti-Disruptions

(ant

i−) d

isru

pted

IPs

010

0K25

0K

(ant

i−) d

isru

pted

IPs

40K

040

K

major US cable ISP: No correlation between disruptions and anti-disruptions (pearson r= 0.02)

Uruguayan ISP: Strong correlation between disruptions and anti-disruptions (pearson r= 0.63)

Anti-Disruptions can heavily skew per-AS and per-country assessment of Internet reliability 13

Case Study: US Broadband ISPs

ISP A Cable

ISP BCable

ISP CCable

ISP DDSL

ISP EDSL

ISP FDSL

ISP GDSL

anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05

% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%

% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%

% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%

14

Case Study: US Broadband ISPs

ISP A Cable

ISP BCable

ISP CCable

ISP DDSL

ISP EDSL

ISP FDSL

ISP GDSL

anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05

% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%

% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%

% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%

Most major US broadband ISPs do not show strong signs of anti-disruptions. 14

Case Study: US Broadband ISPs

ISP A Cable

ISP BCable

ISP CCable

ISP DDSL

ISP EDSL

ISP FDSL

ISP GDSL

anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05

% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%

% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%

% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%

All (but one) ISP has majority of all disrupted /24s duringscheduled maintenance window (Mo-Fr Midnight-6AM). 14

Case Study: US Broadband ISPs

ISP A Cable

ISP BCable

ISP CCable

ISP DDSL

ISP EDSL

ISP FDSL

ISP GDSL

anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05

% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%

% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%

% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%

ISP A: Some 80% of all disruptions fall in the maintenance window, or are caused by force majeure (Hurricane Irma).

Relevant for SLAs, policies, and reliability assessment. 14

Implications

Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018

0

4K(~0.35%)

8K(~0.7%)

hour

ly di

srup

ted

/24s

partial /24 disruptedentire /24 disrupted

• Outage Detection: Methodological Insights • Baseline activity enables fine-granular detection of disruptions• Anti-disruptions due to reassignment

➡ Can bias active and passive outage detection systems

• Interpreting Service Outages • Majority of outages (for many ISPs) during scheduled maintenance

➡ Implications for SLAs, reporting requirements, regulations

15

top related