blink: fast connectivity recovery entirely in the data plane · blink: fast connectivity recovery...
TRANSCRIPT
Blink: Fast Connectivity Recovery Entirely in the Data Plane
Joint work with
Edgar Costa Molero Maria Apostolaki Stefano VissicchioAlberto DainottiLaurent Vanbever
ETH Zürich ETH Zürich University College LondonCAIDA, UC San DiegoETH Zürich
Thomas HolterbachETH Zürich
NSDI26th February 2019
https://blink.ethz.ch
2
www.opte.org
AS level topologyin 2015
3
www.opte.org
AS level topologyin 2015
4
Your network
www.opte.org
AS level topologyin 2015
5
Your networkLocal
www.opte.org
AS level topologyin 2015
6
Remote
Remote
RemoteRemote
Remote
Remote
Your networkLocal
Upon local failures, connectivity can be quickly restored
7
Upon local failures, connectivity can be quickly restored
8
Fast failure detectionusing e.g., hardware-generated signals
Fast traffic reroutingusing e.g., Prefix Independent Convergenceor MPLS Fast Reroute
Upon remote failures, the only way to restore connectivity isto wait for the Internet to converge
9
10
… and the Internet converges very slowly*
*Holterbach et al. SWIFT: Predictive Fast RerouteACM SIGCOMM, 2017
Upon remote failures, the only way to restore connectivity isto wait for the Internet to converge
11
12
0 100 200 300 400 500 600Time difference (s) between the
outage and the first and last withdrawal
0.0
0.2
0.4
0.6
0.8
1.0
CD
F ov
er th
e BG
P pe
ers First
withdrawal
Lastwithdrawal
AS11427
Time difference between the outage
CDF over the BGP peers
Time difference between the outageand the BGP withdrawals (s)
BGP took minutes to converge upon the Time Warner Cable outage in 2014
13
0 100 200 300 400 500 600Time difference (s) between the
outage and the first and last withdrawal
0.0
0.2
0.4
0.6
0.8
1.0
CD
F ov
er th
e BG
P pe
ers First
withdrawal
Lastwithdrawal
AS11427
Time difference between the outage
CDF over the BGP peers
Time difference between the outageand the BGP withdrawals (s)
BGP took minutes to converge upon the Time Warner Cable outage in 2014
Control-plane (e.g., BGP) based techniques typically converge slowlyupon remote outages
14
15
What about using data-plane signals for fast rerouting?
Control-plane (e.g., BGP) based techniques typically converge slowlyupon remote outages
16
Blink: Fast Connectivity Recovery Entirely in the Data Plane
Thomas HolterbachETH Zürich
NSDI26th February 2019
https://blink.ethz.ch
Outline
4. Blink works in practice, on existing devices
1. Why and how to use data-plane signals for fast rerouting
2. Blink infers more than 80% of the failures, often within 1s
3. Blink quickly reroutes traffic to working backup paths
17
Outline
4. Blink works in practice, on existing devices
1. Why and how to use data-plane signals for fast rerouting
2. Blink infers more than 80% of the failures, often within 1s
3. Blink quickly reroutes traffic to working backup paths
18
TCP flows exhibit the same behavior upon failures
19
A:1000
S:500
source destination
20
TCP flows exhibit the same behavior upon failures
A:1000
failure
S:500
source destination
21
TCP flows exhibit the same behavior upon failures
cwnd:4 pkts
S:4100
t
S:3100
S:2100
S:1000
A:1000
failure
(=congestion window)
S:500
source destination
22
TCP flows exhibit the same behavior upon failures
RTO: 200ms cwnd:4 pkts
S:4100
t
S:3100
S:2100
S:1000
A:1000
failure
(=congestion window)
S:500
source destination
23
TCP flows exhibit the same behavior upon failures
Retransmission timeout (RTO) = SRTT + 4∗RTT_VAR
RTO: 200ms cwnd:4 pkts
S:4100
t
t + 200ms cwnd:1
S:3100
S:2100
S:1000
A:1000
failure
(=congestion window)
S:500
source destination
24
TCP flows exhibit the same behavior upon failures
S:1000
Retransmission timeout (RTO) = SRTT + 4∗RTT_VAR
RTO: 200ms cwnd:4 pkts
S:4100
t
t + 200ms cwnd:1
cwnd:1
S:3100
S:2100
S:1000
A:1000
failure
exponential backoff
(=congestion window)
t + 600ms
……
S:500
source destination
25
TCP flows exhibit the same behavior upon failures
S:1000
S:1000
Retransmission timeout (RTO) = SRTT + 4∗RTT_VAR
RTO: 200ms cwnd:4 pkts
S:4100
t
t + 200ms cwnd:1
cwnd:1
S:3100
S:2100
S:1000
A:1000
failure
exponential backoff
(=congestion window)
t + 600ms
…
……
…
t + 1400ms cwnd:1
S:500
source destination
26
TCP flows exhibit the same behavior upon failures
S:1000
S:1000
S:1000
Retransmission timeout (RTO) = SRTT + 4∗RTT_VAR
When multiple flows experience the same failure the signal is a wave of retransmissions
27
*CAIDA equinix-chicagodirection A, 2015
28
Same RTT distributionthan in a real trace*
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
29
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
Number ofretransmissions
Time (s)
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
30
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
Number ofretransmissions
Time (s)
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
31
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
Number ofretransmissions
Time (s)
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
32
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
Number ofretransmissions
Time (s)
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
33
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
Time (s)
Number ofretransmissions
34
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
Time (s)
Number ofretransmissions
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
Outline
4. Blink works in practice, on existing devices
1. Why and how to use data-plane signals for fast rerouting
2. Blink infers more than 80% of the failures, often within 1s
3. Blink quickly reroutes traffic to working backup paths
35
To detect failures, Blink looks at TCP retransmissions
36
To detect failures, Blink looks at TCP retransmissionsProblem: TCP retransmissions can be unrelated to a failure (i.e., noise)
37
38
number of retransmissions
Time
To detect failures, Blink looks at TCP retransmissionsProblem: TCP retransmissions can be unrelated to a failure (i.e., noise)
39
number of retransmissions
Time
congestions
To detect failures, Blink looks at TCP retransmissionsProblem: TCP retransmissions can be unrelated to a failure (i.e., noise)
40
number of retransmissions
Time
one "bogus" flow
To detect failures, Blink looks at TCP retransmissionsProblem: TCP retransmissions can be unrelated to a failure (i.e., noise)
41
number of retransmissions
Time
failure
To detect failures, Blink looks at TCP retransmissionsProblem: TCP retransmissions can be unrelated to a failure (i.e., noise)
Solution #1: Blink looks at consecutive packets with the same sequence number
RTO: 200ms cwnd:4 pkts
S:4100
t
t + 200ms cwnd:1
cwnd:1
S:3100
S:2100
S:1000
A:1000
failure
exponential backoff
(=congestion window)
t + 600ms
…
……
…
t + 1400ms cwnd:1
S:500
source destination
S:1000
S:1000
S:1000
Retransmission timeout (RTO) = SRTT + 4∗RTT_VAR
Solution #1: Blink looks at consecutive packets with the same sequence number
44
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
45
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
46
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
47
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
48
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
49
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
50
number of retransmissions
Time
number of flowsexperiencingretransmissions
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
congestionsone "bogus" flow
failure
51
number of retransmissions
Time
number of flowsexperiencingretransmissions
800ms
congestionsone "bogus" flow
failure
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
52
number of retransmissions
Time
number of flowsexperiencingretransmissions
congestionsone "bogus" flow
failure
Solution #2: Blink monitors the number of flows experiencingretransmissions over time using a sliding window
53
Blink is intended to run in programmable switches
54
Blink is intended to run in programmable switchesProblem: those switches have very limited resources
Solution #1: Blink focuses on the popular prefixes,i.e., the ones that attract data traffic
55
56
As Internet traffic follows a Zipf-like distribution* (1k pref. account for >50%),Blink covers the vast majority of the Internet traffic
*Sarra et al. Leveraging Zipf’s Law for Traffic offloadingACM CCR, 2012
Solution #1: Blink focuses on the popular prefixes,i.e., the ones that attract data traffic
Solution #2: Blink monitors a sample of the flowsfor each monitored prefix
57
TCP flows
Traffic to a destination prefix
58
TCP flows
Traffic to a destination prefix
Solution #2: Blink monitors a sample of the flowsfor each monitored prefix
default 64 flowsmonitored
To monitor active flows, Blink evicts a flow from the sampleif it does not send a packet for a given time (default 2s)
59
60
and selects a new one in a first-seen, first-selected manner
To monitor active flows, Blink evicts a flow from the sampleif it does not send a packet for a given time (default 2s)
61
Blink infers a failure for a prefix when the majority ofthe monitored flows experience retransmissions
62
Time
number of flowsexperiencingretransmissions
Blink infers a failure for a prefix when the majority ofthe monitored flows experience retransmissions
63
Time
number of flowsexperiencingretransmissions
32
FAILURE
Blink infers a failure for a prefix when the majority ofthe monitored flows experience retransmissions
We evaluated Blink failure inference using 15 real traces,13 from CAIDA, 2 from MAWI, covering a total of 15.8 hours
We evaluated Blink failure inference using 15 real traces,13 from CAIDA, 2 from MAWI, covering a total of 15.8 hours
We are interested in:
Accuracy: True Positive Rate vs False Positive Rate
Speed: How long does Blink take to infer failures
As we do not have ground truth, we generated synthetic tracesfollowing the traffic characteristics extracted from the real traces
66
67
Step #1 - We extracted the RTT, Packet rate, Flow durationfrom the real traces
Step #2 - We used NS3 to replay these flowsand simulate a failure
Step #3 - We ran a Python-based version of Blinkon the resulting traces
As we do not have ground truth, we generated synthetic tracesfollowing the traffic characteristics extracted from the real traces
12
34
56
78
910
1112
1314
15Trace ID
0.0
0.2
0.4
0.6
0.8Tr
ue P
ositi
ve R
ate
68
Blink failure inference accuracy is above 80% for 13 real traces out of 15
Real traces ID
True Positive Rate
12
34
56
78
910
1112
1314
15Trace ID
0.0
0.2
0.4
0.6
0.8Tr
ue P
ositi
ve R
ate
69
Real traces ID
True Positive Rate
Blink failure inference accuracy is above 80% for 13 real traces out of 15
70
Blink avoids incorrectly inferring failures when packet loss is below 4%
packet loss % 1 2 3 4 5 8 9…
False Positive Rate
71
Blink avoids incorrectly inferring failures when packet loss is below 4%
packet loss % 1 2 3 4 5 8 9
0 0 0 0.67 0.67 1.3 2.7False Positive Rate …
…
72
Blink infers a failure within 1s for the majority of the cases
12
34
56
78
910
1112
1314
15Trace ID
0
2
4
6Sp
eed
(s)
1s
Real traces ID
Speed (s)
12
34
56
78
910
1112
1314
15Trace ID
0
2
4
6Sp
eed
(s)
1s
73
Blink infers a failure within 1s for the majority of the cases
Real traces ID
Speed (s)
Outline
4. Blink works in practice, on existing devices
1. Why and how to use data-plane signals for fast rerouting
2. Blink infers more than 80% of the failures, often within 1s
3. Blink quickly reroutes traffic to working backup paths
74
75
Upon detection of a failure, Blink immediately activatesbackup paths pre-populated by the control-plane
Problem: since the rerouting is done entirely in the data-plane,Blink cannot prevent forwarding issues
77
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
Problem: since the rerouting is done entirely in the data-plane,Blink cannot prevent forwarding issues
AS4
Problem: since the rerouting is done entirely in the data-plane,Blink cannot prevent forwarding issues
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
AS4
Problem: since the rerouting is done entirely in the data-plane,Blink cannot prevent forwarding issues
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
AS4
BLACKHOLE
Problem: since the rerouting is done entirely in the data-plane,Blink cannot prevent forwarding issues
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
AS4
LOOP
81
Solution: As for failures, Blink uses data-plane signalsto pick a working backup path
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
AS4
32 monitoredflows
32 monitored flows +the non-monitored ones
The probing periodlasts up to 1s
Solution: As for failures, Blink uses data-plane signalsto pick a working backup path
AS3 (backup #2)
AS2(backup #1)
AS1 (primary)
Blink
destination
AS4
Solution: As for failures, Blink uses data-plane signalsto pick a working backup path
As for failures, Blink compares the sequence number ofconsecutive packets to detect blackholes or loops*
84
*See the paper for an evaluation of the rerouting
Outline
4. Blink works in practice, on existing devices
1. Why and how to use data-plane signals for fast rerouting
2. Blink infers more than 80% of the failures, often within 1s
3. Blink quickly reroutes traffic to working backup paths
85
86
We ran Blink on the 15 real traces (15.8 hours)
87
We ran Blink on the 15 real traces (15.8 hours)and it detected 6 outages, each affecting at least 42% of all the flows
88
On current programmable switches, Blink supports up to 10k prefixes
On current programmable switches, Blink supports up to 10k prefixes
89
Number of prefixes
Memory
90
Number of prefixes
Memory
1 pref.
6418 bits
On current programmable switches, Blink supports up to 10k prefixes
91
Number of prefixes
Memory
1 pref.
6418 bits
8 Mb
10k pref.
On current programmable switches, Blink supports up to 10k prefixes
Blink works on a real Barefoot Tofino switch
Blink works on a real Barefoot Tofino switch
TOFINO
Blink works on a real Barefoot Tofino switch
TOFINO
Source
Destination
Number of packetsevery 100ms
Time (s)
RTTs in [10ms; 300ms]
95
Blink works on a real Barefoot Tofino switch
Number of packetsevery 100ms
Time (s)
RTTs in [10ms; 300ms]
96
Blink works on a real Barefoot Tofino switch
Number of packetsevery 100ms
Time (s)
RTTs in [10ms; 300ms]
97
1.1s
Blink works on a real Barefoot Tofino switch
Blink: Fast Connectivity Recovery Entirely in the Data Plane
Infers failures from data-plane signalswith more than 80% accuracy, and often within 1s
Fast reroutes traffic at line rateto working backup paths
Works on real traffic traces and on existing devices
https://blink.ethz.ch
Blink: Fast Connectivity Recovery Entirely in the Data Plane
Joint work with
Edgar Costa Molero Maria Apostolaki Stefano VissicchioAlberto DainottiLaurent Vanbever
ETH Zürich ETH Zürich University College LondonCAIDA, UC San DiegoETH Zürich
Thomas HolterbachETH Zürich
NSDI26th February 2019
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
100
*CAIDA equinix-chicagodirection A, 2015
Same RTT distributionthan in a real trace*
Time (s)
Number ofretransmissions
We simulated a failure affecting100k flows with NS3
When multiple flows experience the same failure the signal is a wave of retransmissions
RTT x1.5
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
101
Time (s)
When multiple flows experience the same failure the signal is a wave of retransmissions
RTT x1.5
0 1 2 3 4 5 6 7Time (s)
0
10K
20K
30K
40K
50K
60K
70K
Num
ber o
f ret
rans
mis
sion
s
Failu
re
102
Blink failure inference accuracy is close to a best case scenario,and is above 80% for 13 real traces out of 15
12
34
56
78
910
1112
1314
15Trace ID
0.0
0.2
0.4
0.6
0.8
1.0
True
Pos
itive
Rat
e
"best case", i.e.,no sampling butthreshold still 32
12
34
56
78
910
1112
1314
15Trace ID
0.0
0.2
0.4
0.6
0.8
1.0
True
Pos
itive
Rat
eBlink
12
34
56
78
910
1112
1314
15Trace ID
0
2
4
6Sp
eed
(s)
1s
103
Blink infers a failure within 1s for the majority of the cases
"best case", i.e.,no sampling butthreshold still 32
Blink
Real traces ID
Speed (s)
104
Blink avoids incorrectly inferring failures when packet loss is below 4%
packet loss % 1 2 3 4 5 8 9
0 0 0 0.67 0.67 1.3 2.7
False Positive Rate
Blink
no sampling butthreshold still 32 59 85 93 94 95 97
…
…
… 98
Blink quickly infers and avoids forwarding loops
Time (s)
Number of packetsevery 100ms