advancing the art of internet edge outage detection (slides) › richterp ›...
Post on 06-Jul-2020
1 Views
Preview:
TRANSCRIPT
Advancing the Art of Internet Edge Outage Detection
Philipp Richter MIT / Akamai
Ramakrishna Padmanabhan University of Maryland
Neil Spring University of Maryland
Arthur Berger Akamai / MIT
David Clark MIT
ACM Internet Measurement Conference 2018
Internet Edge Outages
1
Internet Edge Outages
➡ Uninterrupted Internet availability becomes increasingly critical ➡ Growing interest in systems to detect Internet edge outages➡ Regulatory Bodies, Governments, ISPs, Academic Research
1
Internet Edge Outages
➡ Uninterrupted Internet availability becomes increasingly critical ➡ Growing interest in systems to detect Internet edge outages➡ Regulatory Bodies, Governments, ISPs, Academic Research
Goal: Track Internet edge outages on (i) a broad scale and (ii) a detailed level
1
Edge Outage Detection: Existing Approaches
• Detecting outages in the control plane? ➡ Edge outages often invisible in BGP
• Deploying hardware in end user premises? ➡ Potentially highly accurate, but does not scale
• Actively probing addresses? ➡ Challenging to scale, unresponsive addresses, difficult to interpret
2
Edge Outage Detection: Existing Approaches
• Detecting outages in the control plane? ➡ Edge outages often invisible in BGP
• Deploying hardware in end user premises? ➡ Potentially highly accurate, but does not scale
• Actively probing addresses? ➡ Challenging to scale, unresponsive addresses, difficult to interpret
This work: Passive edge outage detection based on CDN request patterns
2
Outline
•Disruption Detection
•Global View on Disruptions
•Device-Centric View on Disruptions
•U.S. Broadband Case Study
Disruption Detection using CDN Access Logs
200,000+ servers1500+ ASes120+ countries> 3 trillion daily requests
Dataset: Hourly request counts per IPv4 /24 address block
3
Disruption Detection using CDN Access Logs
200,000+ servers1500+ ASes120+ countries> 3 trillion daily requests
Dataset: Hourly request counts per IPv4 /24 address block
Assumption: Edge outages will be reflected in absence/reduction of CDN requests.
3
Active IP Addresses per /24
hours
activ
e IP
v4 a
ddre
sses
per
hou
r
1 168 336 504 672
050
100
150
200
250
typical residential /24 address block
4
Active IP Addresses per /24: Baseline Activity
hours
activ
e IP
v4 a
ddre
sses
per
hou
r
1 168 336 504 672
050
100
150
200
250
typical residential /24 address block
baseline value:93 IP addresses
Number of addresses contacting the CDN every hour never drops below ‘baseline value’ value per block.
4
Active IP Addresses per /24: Baseline Activity
hours
activ
e IP
v4 a
ddre
sses
per
hou
r
1 168 336 504 672
050
100
150
200
250
typical residential /24 address block
baseline value:93 IP addresses
home networkCDN
content requests,update requests,
widgets, beacons, etc.
icons: mavadee / flaticon4
Active IP Addresses per /24: Baseline Activity
hours
activ
e IP
v4 a
ddre
sses
per
hou
r
1 168 336 504 672
050
100
150
200
250
typical residential /24 address block
baseline value:93 IP addresses
home networkCDN
content requests,update requests,
widgets, beacons, etc.
• Some 2.3M /24 address blocks (83% of all client IPs) baseline > 40• Robust signal, largely independent of user-triggered activity• Dependent on a functioning network 4
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
Disruption Detection
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
Disruption Detection
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
activity in next hour violates baseline criterion
Disruption Detection
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
baseline b1: sliding window
beta * min [last 168 hours]
Disruption Detection
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
Disruption Detection
baseline b1: sliding window
beta * min [last 168 hours]
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
new baseline found
Disruption Detection
baseline b1: sliding window
beta * min [last 168 hours]
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
}
non-steadystate
Disruption Detection
baseline b1: sliding window
beta * min [last 168 hours]
5
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
baseline b0: sliding window
alpha * min [last 168 hours]
}
Disruption Detection
non-steadystate
baseline b1: sliding window
beta * min [last 168 hours]
5
Did this Address Block really go offline?
hour [starting 2017−08−30]
activ
e IP
v4 a
ddre
sses
0 50 100 150 200 250 300 350
050
100
150
ICMP responsiveCDN active
6
Did this Address Block really go offline?
hour [starting 2017−08−30]
activ
e IP
v4 a
ddre
sses
0 50 100 150 200 250 300 350
050
100
150
ICMP responsiveCDN active
➡ “Address Space Survey Data” (provided by ISI) ➡ Ping every address in 1% of the allocated IPv4 /24s every 11 mins➡ Some address blocks show a very steady number of responsive IPs
6
hour [starting 2017−08−30]
activ
e IP
v4 a
ddre
sses
0 50 100 150 200 250 300 350
050
100
150
ICMP responsiveCDN active
Did this Address Block really go offline?
➡ “Address Space Survey Data” (provided by ISI) ➡ Ping every address in 1% of the allocated IPv4 /24s every 11 mins➡ Some address blocks show a very steady number of responsive IPs
6
CDN Disruptions vs. ICMP180.253.118.0/24
hour
activ
e IP
v4 a
ddre
sses
CDN active IPsICMP responsive IPs
0 168 336
050
100
150
200
250
170.250.45.0/24
hour
activ
e IP
v4 a
ddre
sses
CDN active IPsICMP responsive IPs
0 168 336
050
100
150
200
250
68.200.4.0/24
hour
activ
e IP
v4 a
ddre
sses
CDN active IPsICMP responsive IPs
0 168 336
050
100
150
200
250
73.204.202.0/24
hour
activ
e IP
v4 a
ddre
sses
CDN active IPsICMP responsive IPs
0 168 336
050
100
150
200
250
6
Global View on Disruptions
Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018
0
4K(~0.35%)
8K(~0.7%)
hour
ly di
srup
ted
/24s partial /24 disrupted
entire /24 disrupted
1.5M disruption events over the course of one year
8
Global View on Disruptions
• At least 0.2% of the monitored space faces a disruption• Natural disasters, intended Internet shutdowns• Weekly pattern, mostly absent during Christmas/NYE
Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018
0
4K(~0.35%)
8K(~0.7%)
hour
ly di
srup
ted
/24s partial /24 disrupted
entire /24 disrupted
1.5M disruption events over the course of one year
8
Global View on Disruptions
• At least 0.2% of the monitored space faces a disruption• Natural disasters, intended Internet shutdowns• Weekly pattern, mostly absent during Christmas/NYE
Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018
0
4K(~0.35%)
8K(~0.7%)
hour
ly di
srup
ted
/24s partial /24 disrupted
entire /24 disrupted
1.5M disruption events over the course of one year
Hurricane Irma
8
Weekly Pattern
Mon Tue Wed Thu Fri Sat Sun
day of the week [local time]
fract
ion
of d
isru
ptio
n ev
ents
0.00
0.05
0.10
0.15
0.20
allentire /24
9
Weekly Pattern
Mon Tue Wed Thu Fri Sat Sun
day of the week [local time]
fract
ion
of d
isru
ptio
n ev
ents
0.00
0.05
0.10
0.15
0.20
allentire /24
●
●●
●
●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
start hour [local time]
fract
ion
of d
isru
ptio
n ev
ents
1 3 5 7 9 11 13 15 17 19 21 23
0.00
0.05
0.10
0.15
●
●
●
●
●
●
●● ●
● ●● ●
● ● ● ●● ● ● ● ● ●
●
●
●
allentire /24
9
Weekly Pattern: Scheduled Maintenance Window
Mon Tue Wed Thu Fri Sat Sun
day of the week [local time]
fract
ion
of d
isru
ptio
n ev
ents
0.00
0.05
0.10
0.15
0.20
allentire /24
●
●●
●
●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
start hour [local time]
fract
ion
of d
isru
ptio
n ev
ents
1 3 5 7 9 11 13 15 17 19 21 23
0.00
0.05
0.10
0.15
●
●
●
●
●
●
●● ●
● ●● ●
● ● ● ●● ● ● ● ● ●
●
●
●
allentire /24
9
Outline
•Disruption Detection
•Global View on Disruptions
•Device-Centric View on Disruptions
•U.S. Broadband Case Study
Device Perspective on Disruptions
hours
activ
e IP
v4 a
ddre
sses
1 168 336 504 672
050
100
150
200
➡ Established that this address block lost connectivity➡ Do devices really lose Internet connectivity? ➡ Or can they connect from somewhere else?
?
10
Device-Specific Dataset for Subset of Users
softwareID A software
ID Bsoftware
ID C
10
softwareID A software
ID Bsoftware
ID C
➡ Device-information for 52K “entire /24” disruption events➡ Only consider disruptions which affected an entire /24
(no activity during the disruption)
Device-Specific Dataset for Subset of Users
disruption in1.2.3.0/24
IPbefore
∈ 1.2.3.0/24
time
IPduring
∉ 1.2.3.0/24IPafter
∈/∉ 1.2.3.0/24
10
Device Perspective on Disruptions
disruptions with active IDs < 1hr before
N=52K
11
Device Perspective on Disruptions
disruptions with active IDs < 1hr before
N=52K
no active IDs during disruption
86%
expected case, devicesdo not have connectivity
11
Device Perspective on Disruptions
disruptions with active IDs < 1hr before
N=52K
no active IDs during disruption
86%
expected case, devicesdo not have connectivity
active IDs during disruption
from different IP address
14%
11
Device Perspective on Disruptions
disruptions with active IDs < 1hr before
N=52K
no active IDs during disruption
86%
expected case, devicesdo not have connectivity
active IDs during disruption
from different IP address
14%
switch to/from cellular switch AS
20% 13%
mobility and/or multi-homed devices
11
Device Perspective on Disruptions
disruptions with active IDs < 1hr before
N=52K
no active IDs during disruption
86%
expected case, devicesdo not have connectivity
active IDs during disruption
from different IP address
14%
switch to/from cellular switch AS same AS
20% 13%
mobility and/or multi-homed devices
67%
? (10% of disruptions)
11
Anti-Disruptions: Temporary Surges in Address Activity
IPs
in d
isru
pted
/alte
rnat
e /2
4
0 20 40 60 80 100
200
100
010
020
0
disrupted /24alternate /24
time [hours]
ID XYZ
ID XYZ
12
Anti-Disruptions: Temporary Surges in Address Activity
IPs
in d
isru
pted
/alte
rnat
e /2
4
0 20 40 60 80 100
200
100
010
020
0
disrupted /24alternate /24
time [hours]
➡ Some disruptions not service outages but address reassignment!
➡ Developed mechanism to detect anti-disruptions➡ Rank ASes by correlation of disruptions with anti-disruptions
ID XYZ
ID XYZ
12
AS-Level Disruptions and Anti-Disruptions
(ant
i−) d
isru
pted
IPs
010
0K25
0K
(ant
i−) d
isru
pted
IPs
40K
040
K
major US cable ISP: No correlation between disruptions and anti-disruptions (pearson r= 0.02)
Uruguayan ISP: Strong correlation between disruptions and anti-disruptions (pearson r= 0.63)
13
AS-Level Disruptions and Anti-Disruptions
(ant
i−) d
isru
pted
IPs
010
0K25
0K
(ant
i−) d
isru
pted
IPs
40K
040
K
major US cable ISP: No correlation between disruptions and anti-disruptions (pearson r= 0.02)
Uruguayan ISP: Strong correlation between disruptions and anti-disruptions (pearson r= 0.63)
Anti-Disruptions can heavily skew per-AS and per-country assessment of Internet reliability 13
Case Study: US Broadband ISPs
ISP A Cable
ISP BCable
ISP CCable
ISP DDSL
ISP EDSL
ISP FDSL
ISP GDSL
anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05
% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%
% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%
% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%
14
Case Study: US Broadband ISPs
ISP A Cable
ISP BCable
ISP CCable
ISP DDSL
ISP EDSL
ISP FDSL
ISP GDSL
anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05
% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%
% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%
% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%
Most major US broadband ISPs do not show strong signs of anti-disruptions. 14
Case Study: US Broadband ISPs
ISP A Cable
ISP BCable
ISP CCable
ISP DDSL
ISP EDSL
ISP FDSL
ISP GDSL
anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05
% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%
% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%
% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%
All (but one) ISP has majority of all disrupted /24s duringscheduled maintenance window (Mo-Fr Midnight-6AM). 14
Case Study: US Broadband ISPs
ISP A Cable
ISP BCable
ISP CCable
ISP DDSL
ISP EDSL
ISP FDSL
ISP GDSL
anti-disruptioncorrelation 0.22 0.03 -0.02 0.03 0.00 -0.04 0.05
% disruptions w/ intermittent activity 4% 1% 1% 0% 3% 6% 14%
% /24s only disrupted maintenance window 67% 54% 75% 29% 60% 71% 62%
% /24s only disruptedduring Hurricane Irma 11% 1% 2% 23% 1% 0% 3%
ISP A: Some 80% of all disruptions fall in the maintenance window, or are caused by force majeure (Hurricane Irma).
Relevant for SLAs, policies, and reliability assessment. 14
Implications
Apr 2017 May 2017 Jun 2017 Jul 2017 Aug 2017 Sep 2017 Oct 2017 Nov 2017 Dec 2017 Jan 2018 Feb 2018
0
4K(~0.35%)
8K(~0.7%)
hour
ly di
srup
ted
/24s
partial /24 disruptedentire /24 disrupted
• Outage Detection: Methodological Insights • Baseline activity enables fine-granular detection of disruptions• Anti-disruptions due to reassignment
➡ Can bias active and passive outage detection systems
• Interpreting Service Outages • Majority of outages (for many ISPs) during scheduled maintenance
➡ Implications for SLAs, reporting requirements, regulations
15
top related