detailed analysis of isis routing protocol on the qwest ... · isis basics detection link up/down...
Post on 13-Apr-2020
4 Views
Preview:
TRANSCRIPT
Packet Design 1
Detailed Analysis of ISISRouting Protocol on the Qwest Backbone:
Cengiz Alaettinoglucengiz@packetdesign.com
Stephen Casnercasner@packetdesign.com
A recipe for subsecond ISIS convergence
Packet Design 2
Why subsecond convergence?
Increased network reliability
Support for multi−service trafficVoice over IP, ATM over IP, TDM over IP, ...
Lower cost/complexity compared to layer 2 protection schemes like SONET
Packet Design 3
Where are we today?
Current IP re−route times are typically tens of seconds
We need to do better. There are two choices:
Figure out what’s wrong with IP routing and fix it
Replace IP routing with something else
Packet Design 4
Qwest Backbone ISIS Analysis
Collected multiple week−long ISIS packet traces
Identified problem areas:
causes of ISIS churn and stability
sequence of events and delays during routing convergence
Conclude with a recipe for achieving subsecond ISIS convergence
Packet Design 5
Monitoring Qwest ISIS Routing
Multi−vendor backbone: Ciscos, Junipers, ...All point−to−point backbone: OC48, OC192, ...
R
Collection Host
Collection Host
RBurbank Houston
tracerouteevery 5s
ISISHELLOs +capture
UDP stream (tg)packet per avg 6 msec
skgig−eth
Collection host ispassive peer,
sends no LSPs
QwestBackbone
Packet Design 6
Our Typical Path
6 hops
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunyvale
Packet Design 7
ISIS Basics
DetectionLink up/down or peer reachabilityHardware detection is fast & preferredSoftware detection using an HELLO protocol is slower but is a backup
PropagationFlood a Link State Packet (LSP)Link propagation delays + per hop processing delayRate limiting may slow propagation
New Route ComputationRun Dijkstra’s Shortest Path First (SPF) algorithm CPU resource intensiveRate limiting may delay SPF computation & consistency
Packet Design 8
ISIS Churn and Instability
Churn: number of LSPs received over a time periodrequires SPF calculationsconsumes CPU resources
Busy CPU may cause HELLO packet missescan falsely bring adjacencies downincreases churn
Instabilitychurn => busy CPU => HELLO misses => more churn => ...rate limits are for avoiding this instability
Packet Design 9
HELLO Packets
Excellent HELLO behavior
0
2000
4000
6000
8000
10000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
Inte
rval
bet
wee
n he
llo p
acke
ts (
mse
cs)
Packet Design 10
Churn
Very stable network:Average rate for total churn: 1 LSP every 2 seconds97.6% of the LSPs are state refreshes
0
20
40
60
80
100
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
Num
ber
of L
SP
s pe
r m
inut
e
Total Churn
0
20
40
60
80
100
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12N
umbe
r of
LS
Ps
per
min
ute
Physical Churn
High churn region
Packet Design 11
Churn by Routers & Links
About 800 LSPs per week per router is for refreshesThis can be configured to be less
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100
Num
ber
of L
SP
s ge
nera
ted
Percentage of Routers
0
100
200
300
400
500
600
700
800
900
0 20 40 60 80 100
Num
ber
of F
laps
Percentage of Links
Problem routers
Problem links
Packet Design 12
One Atypical Router
This router is responsible for the initial high churn
62000
62500
63000
63500
64000
64500
65000
65500
66000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
LSP
seq
uenc
e nu
mbe
r
110 times its usual churn
Packet Design 13
Up
Down
06:00 06:10 06:20 06:30 06:40 06:50 07:00
September 24, 2001
An Unstable Link
No protection against instabilityGoes on for a day
Opposite of fast convergence requirement30 seconds to go down, 8 seconds to go up
Packet Design 14
Dealing with Instability
SPF & LSP propagation rate limits don’t reduce churndoes keep CPU from melting by ignoring change
To reduce churn without impacting convergence:
Asymetric up/down filters for fast convergence
detect bad news fast
slow down on good news
Adaptive filters
linear or exponential adaptation to level of instability
Less CPU intensive incremental SPF algorithms
Packet Design 15
An Example Adaptive Filter
An example exponential filter with 20 minute max penalty
It reduces the churn without hurting convergence
Up
Down
06:00:00 06:15:00 06:30:00 06:45:00 07:00:00
Up
Down
06:00:00 06:15:00 06:30:00 06:45:00 07:00:00
20 mins 20 mins
Packet Design 16
Qwest ISIS Stability Summary
The backbone is extremely stable
3 out of the 4 week−long data collection periods have no route change on our path
the churn is caused by few problem links
Convergence timesHard to find a link failure to diagnoseConvergence as fast as today’s technology allowsCan be improved to subsecond
Packet Design 17
ISIS Convergence Delay
Time from the physical change to new routing tables
Failure/repair detection
LSP propagation
Delay due to SPF−interval
SPF computation
Packet Design 18
What Happens During Convergence?
Routers perform SPF while their views of the network are not consistent, causing:
routing loops
black hole routes
suboptimal routes
If fast convergence, this is not an issue. But,Convergence times are not fastOn high speed links, lots of packets are affectedNew services are less tolerant
Packet Design 19
A Link Failure
0
0.02
07:05:40 07:05:55 07:06:10 07:06:25
Del
ay a
nd L
oss
October 10, 2001
Packet LossDelay
8.5 seconds
1310 packets R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunyvale
Why so long?
Packet Design 20
Slow Link Failure Detection
0
0.02
07:05:40 07:05:55 07:06:10 07:06:25
Del
ay, L
oss,
and
LS
P r
ecep
tion
October 10, 2001
Packet LossDelay
Burbank LSPLos Angeles LSP
20 seconds, HELLO protocol
3 seconds, link layer detection
Packet Design 21
LSP Propagation Delay
0
200
400
600
800
1000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
LSP
pro
paga
tion
times
(m
secs
)
Difference of 2 clocks
LSP propagations are rate limited
Their scheduling may be improved
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunyvale
Packet Design 22
SPF Rate Limiting
spf−interval parameter delays SPF computation
default: SPF computation after 5 seconds from the change
visible in our convergence delay example
Two goals:
to contain the CPU load
no more than one SPF computation per 5 seconds
to capture 2−4 LSPs reporting the failure in one SPF run
fails to do this in our case
Packet Design 23
Why Loss and Delay at Link Repair?
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay a
nd P
acke
t Los
s
October 10, 2001
Packet LossDelay
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
227 packets lost
SunyvaleR1
Routing loop?
1.5s
Packet Design 24
TTLs Confirm the Routing Loop
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay, T
TL
October 10, 2001
DelayTTL
Longer path
Routing loopConvergence
1.6s 1.5s
ttl=8ttl=11 ttl=8
ttl=60,...,10
Packet Design 25
spf−interval Spreads SPFs
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay, T
TL
and
LSP
rec
eptio
ns
October 10, 2001
DelayTTL
5secs
5secs
Router1 SPF
5secs
Router2 SPFRouter3 SPF
3 routers start timers after 3 different LSPs
Bbank’s lspLA’s lsp
Some lsp
Packet Design 26
SPF Computation Times
1
10
100
1000
10000
0 10 20 30 40 50 60 70 80 90 100
SP
F C
ompu
tatio
n T
imes
(m
icro
seco
nds)
Percentage of SPF runs
Dijkstra SPFIncremental SPF
Qwest topology and events using equal−cost multiple paths
On average 84 times faster
avg=13usec
avg=1069usec
Packet Design 27
SPF Scaling
0.01
0.1
1
10
100
1000
1000 2000 3000 4000 5000
SP
F C
ompu
tatio
n tim
es (
mill
isec
onds
)
Number of Routers
Dijkstra SPFIncremental SPF Using random
topology and events
Incremental SPF lets the network grow very large
Source: Haobo Yu.
Packet Design 28
Where does the time go?
Detection times were several seconds, must be improved
LSP Propagation times were subsecond, but still much larger than link propagation delays
SPF rate limitingspf−interval causes most of the convergence delay
spf−interval spreads SPFs, groups wrong set of LSPs into the same SPF
Packet Design 29
A Recipe for Subsecond ISIS Convergence
Step 1. Vendors: fast link failure detectionHardware detection is preferred
Vendors have fast failure detection solution for MPLS fast reroute
It will benefit convergence immediately
Step 2. Vendors: adaptive and asymmetric Up/Down filtersIt will reduce the ISIS churn w/o hurting convergence
Step 3. Operators: eliminate current LSP & SPF rate limitsAdaptive asymmetric filters make it safe
Step 4. Vendors: incremental SPF algorithmA must for avoiding CPU meltdowns even as the network gets bigger
Packet Design 30
Acknowledgements
We would like to thank Chris Sieber, Paul Mabey and Shankar Rao of Qwest for providing access to their network for our study and for their review of our results.
We would also like to thank Haobo Yu, Van Jacobson, Kathleen Nichols, and Kedar Poduri of Packet Design for their contributions to this work.
Packet Design 31
What can you do?
If you want subsecond IGP convergence,ask your vendor to implement this recipe.
top related