Packet Design 1
Detailed Analysis of ISISRouting Protocol on the Qwest Backbone:
Cengiz [email protected]
Stephen [email protected]
A recipe for subsecond ISIS convergence
Packet Design 2
Why subsecond convergence?
Increased network reliability
Support for multi−service trafficVoice over IP, ATM over IP, TDM over IP, ...
Lower cost/complexity compared to layer 2 protection schemes like SONET
Packet Design 3
Where are we today?
Current IP re−route times are typically tens of seconds
We need to do better. There are two choices:
Figure out what’s wrong with IP routing and fix it
Replace IP routing with something else
Packet Design 4
Qwest Backbone ISIS Analysis
Collected multiple week−long ISIS packet traces
Identified problem areas:
causes of ISIS churn and stability
sequence of events and delays during routing convergence
Conclude with a recipe for achieving subsecond ISIS convergence
Packet Design 5
Qwest ISIS Findings Summary
The backbone is extremely stable
3 out of the 4 week−long data collection periods have no route change on our path
the churn is caused by few problem links
Convergence timesHard to find a link failure to diagnoseConvergence as fast as today’s technology allowsCan be improved to subsecond
Packet Design 6
ISIS Routing Protocol 101:Link State Packets
An LSP describes the local topology of a router
List of adjacenciesHello protocol
Router
Link / adjacency
Packet Design 7
ISIS Routing Protocol 101:Flooding
LSPs are flooded
link propagation delays + per−hop processing delays
Packet Design 8
ISIS Routing Protocol 101:Shortest Path Calculation
A router constructs the current topology
Dijkstra’s SPF algorithm determines the routes
+ + + ... =
Packet Design 9
ISIS Routing Protocol 102:Convergence after a Topology Change
DetectionLink up/down or peer reachabilityHardware detection is fast & preferredSoftware detection using an HELLO protocol is slower but is a backup
PropagationFlood a Link State Packet (LSP)Link propagation delays + per hop processing delayRate limiting may slow propagation
New Route ComputationRun Dijkstra’s Shortest Path First (SPF) algorithm CPU resource intensiveRate limiting may delay SPF computation & consistency
Packet Design 10
Our Measurement Setup
Multi−vendor backbone: Ciscos, Junipers, ...All point−to−point backbone: OC48, OC192, ...
R
Collection Host
Collection Host
RBurbank Houston
tracerouteevery 5s
ISISHELLOs +capture
UDP stream (tg)packet per avg 6 msec
skgig−eth
Collection host ispassive peer,
sends no LSPs
QwestBackbone
Packet Design 11
Our Typical Path
6 hops
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunnyvale
Packet Design 12
ISIS Churn and Instability
Churn: number of LSPs received over a time periodrequires SPF calculationsconsumes CPU resources
Busy CPU may cause HELLO packet missescan falsely bring adjacencies downincreases churn
Instabilitychurn => busy CPU => HELLO misses => more churn => ...rate limits are for avoiding this instability
Packet Design 13
HELLO Packets
Excellent HELLO behavior
0
2000
4000
6000
8000
10000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
Inte
rval
bet
wee
n he
llo p
acke
ts (
mse
cs)
Packet Design 14
Churn
Very stable network:Average rate for total churn: 1 LSP every 2 seconds97.6% of the LSPs are state refreshes
0
20
40
60
80
100
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
Num
ber
of L
SP
s pe
r m
inut
e
Total Churn
0
20
40
60
80
100
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12N
umbe
r of
LS
Ps
per
min
ute
Physical Churn
High churn region
Packet Design 15
Churn by Routers & Links
About 800 LSPs per week per router is for refreshesThis can be configured to be less
0
500
1000
1500
2000
2500
3000
3500
4000
0 20 40 60 80 100
Num
ber
of L
SP
s ge
nera
ted
Percentage of Routers
0
100
200
300
400
500
600
700
800
900
0 20 40 60 80 100
Num
ber
of F
laps
Percentage of Links
Problem routers
Problem links
Packet Design 16
One Atypical Router
This router is responsible for the initial high churn
62000
62500
63000
63500
64000
64500
65000
65500
66000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
LSP
seq
uenc
e nu
mbe
r
110 times its usual churn
Packet Design 17
Up
Down
06:00 06:10 06:20 06:30 06:40 06:50 07:00
September 24, 2001
An Unstable Link
No protection against instabilityGoes on for a day
Opposite of fast convergence requirement30 seconds to go down, 8 seconds to go up
Packet Design 18
Dealing with Instability
SPF & LSP propagation rate limits don’t reduce churndoes keep CPU from melting by ignoring change
To reduce churn without impacting convergence:
Asymetric up/down filters for fast convergencedetect bad news fast
slow down on good news
Adaptive filterslinear or exponential adaptation to level of instability
Less CPU intensive incremental SPF algorithms
Packet Design 19
An Example Adaptive Filter
An example exponential filter with 20 minute max penalty
It reduces the churn without hurting convergence
Up
Down
06:00:00 06:15:00 06:30:00 06:45:00 07:00:00
Up
Down
06:00:00 06:15:00 06:30:00 06:45:00 07:00:00
20 mins 20 mins
Packet Design 20
ISIS Convergence Delay
Time from the physical change to new routing tables
Failure/repair detection
LSP propagation
Delay due to SPF−interval
SPF computation
Packet Design 21
What Happens During Convergence?
Routers perform SPF while their views of the network are not consistent, causing:
routing loops
black hole routes
suboptimal routes
If fast convergence, this is not an issue. But,Convergence times are not fastOn high speed links, lots of packets are affectedNew services are less tolerant
Packet Design 22
A Link Failure
0
0.02
07:05:40 07:05:55 07:06:10 07:06:25
Del
ay a
nd L
oss
October 10, 2001
Packet LossDelay
8.5 seconds
1310 packets R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunyvale
Why so long?
Packet Design 23
Slow Link Failure Detection
0
0.02
07:05:40 07:05:55 07:06:10 07:06:25
Del
ay, L
oss,
and
LS
P r
ecep
tion
October 10, 2001
Packet LossDelay
Burbank LSPLos Angeles LSP
20 seconds, HELLO protocol
3 seconds, link layer detection
Packet Design 24
LSP Propagation Delay
0
200
400
600
800
1000
Oct 06 Oct 07 Oct 08 Oct 09 Oct 10 Oct 11 Oct 12
LSP
pro
paga
tion
times
(m
secs
)
Difference of 2 clocks
LSP propagations are rate limited
Their scheduling may be improved
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
R1Sunyvale
Packet Design 25
SPF Rate Limiting
spf−interval parameter delays SPF computation
default: SPF computation after 5 seconds from the change
visible in our convergence delay example
Two goals:
to contain the CPU loadno more than one SPF computation per 5 seconds
to capture 2−4 LSPs reporting the failure in one SPF runfails to do this in our case
Packet Design 26
Why Loss and Delay at Link Repair?
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay a
nd P
acke
t Los
s
October 10, 2001
Packet LossDelay
R1 R2
R4 R1
R2R1
BurbankHouston
Los Angeles
R3
227 packets lost
SunyvaleR1
Routing loop?
1.5s
Packet Design 27
TTLs Confirm the Routing Loop
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay, T
TL
October 10, 2001
DelayTTL
Longer path
Routing loopConvergence
1.6s 1.5s
ttl=8ttl=11 ttl=8
ttl=60,...,10
Packet Design 28
spf−interval Spreads SPFs
00.02
0.19
07:06:30 07:06:35 07:06:40 07:06:45
Del
ay, T
TL
and
LSP
rec
eptio
ns
October 10, 2001
DelayTTL
5secs
5secs
Router1 SPF
5secs
Router2 SPFRouter3 SPF
3 routers start timers after 3 different LSPs
Bbank’s lspLA’s lsp
Some lsp
Packet Design 29
SPF Computation Times
1
10
100
1000
10000
0 10 20 30 40 50 60 70 80 90 100
SP
F C
ompu
tatio
n T
imes
(m
icro
seco
nds)
Percentage of SPF runs
Dijkstra SPFIncremental SPF
Qwest topology and events using equal−cost multiple paths
On average 84 times faster
avg=13usec
avg=1069usec
Packet Design 30
SPF Scaling
0.01
0.1
1
10
100
1000
1000 2000 3000 4000 5000
SP
F C
ompu
tatio
n tim
es (
mill
isec
onds
)
Number of Routers
Dijkstra SPFIncremental SPF Using random
topology and events
Incremental SPF lets the network grow very large
Source: Haobo Yu.
Packet Design 31
Where does the time go?
Detection times were several seconds, must be improved
LSP Propagation times were subsecond, but still much larger than link propagation delays
SPF rate limitingspf−interval causes most of the convergence delay
spf−interval spreads SPFs, groups wrong set of LSPs into the same SPF
Packet Design 32
A Recipe for Subsecond ISIS Convergence
Vendors: fast link failure detectionHardware detection is preferred
Vendors have fast failure detection solution for MPLS fast reroute
It will benefit convergence immediately
Vendors: adaptive and asymmetric Up/Down filtersIt will reduce the ISIS churn w/o hurting convergence
Operators: eliminate current LSP & SPF rate limitsAdaptive asymmetric filters make it safe
Vendors: incremental SPF algorithmA must for avoiding CPU meltdowns even as the network gets bigger
Packet Design 33
Acknowledgements
We would like to thank Chris Sieber, Paul Mabey and Shankar Rao of Qwest for providing access to their network for our study and for their review of our results.
We would also like to thank Haobo Yu, Van Jacobson, Kathleen Nichols, and Kedar Poduri of Packet Design for their contributions to this work.