bgp anomaly detection in an isp
TRANSCRIPT
![Page 1: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/1.jpg)
1
BGP Anomaly Detection in an ISP
Jian Wu (U. Michigan)Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton)
Jia Wang (AT&T Labs)
http://www.cs.princeton.edu/~jrex/papers/nsdi05-jian.pdf
![Page 2: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/2.jpg)
2
Goal
Identify important anomalies Lost reachability Persistent flapping Large traffic shifts
Contributions:
•Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time.
•Use the tool to characterize routing disruptions in an operational network
![Page 3: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/3.jpg)
3
Capturing Routing Changes
CBRCBR
CPEBGP Monit
or
CBRCBR
CBRCBR
CBRCBR
CBRCBR
CBRCBR
iBGP
iBGP
iBG
P
iBG
P
iBGP
iBGP
eBGP
eBG
P
eBGP
eBG
P
eBGP
eBGP
UpdatesUpdates
Best routesBest ro
utes
Large operational network(8/16/2004 – 10/10-2004)
![Page 4: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/4.jpg)
4
Challenges
Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage
Different than root-cause analysis Identify changes and their effects Focus on actionable events Diagnose causes only in/near the AS
![Page 5: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/5.jpg)
5
System Architecture
Event Classification
Event Classification
“Typed”Events
EEBR
EEBR
EEBR
BGP Updates
(106)
BGP Update Grouping
BGP Update Grouping
Events
Persistent Flapping Prefixes
(101)
(105)
EventCorrelation
EventCorrelation
Clusters
Frequent Flapping Prefixes
(103)
(101)
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBREEBR EEBR
LargeDisruptions
Netflow Data
(101)
![Page 6: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/6.jpg)
6
Grouping BGP Update into Events
Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers
Solution: •Group all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.
BGP Update Grouping
BGP Update Grouping
EEBR
EEBR
EEBR
BGP Updates
Events
Persistent Flapping Prefixes
![Page 7: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/7.jpg)
7
Grouping Thresholds
Based on data analysis and our understanding of BGP
Event timeout: 70 seconds 2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds
Convergence timeout: 10 minutes BGP usually converges within minutes 99.9% events < 10 minutes
![Page 8: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/8.jpg)
8
Persistent Flapping Prefixes
Causes of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interface or BGP session (3.0%)
Surprising finding: 15.2% of updates were caused by persistent flapping prefixes, even though flap damping was enabled!
![Page 9: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/9.jpg)
9
Example: Unstable eBGP Session
ISP Peer
CustomerEC
EB
EA ED
p
Flap damping parameters are session-based Damping not implemented for iBGP sessions
![Page 10: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/10.jpg)
10
Event Classification
Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of traffic through the network
Event Classification
Event Classification
Events “Typed”Events
Solution: classify events by severity of their impacts
![Page 11: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/11.jpg)
11
Event Category – “No Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
No Traffic Shift
“No Disruption”: each of the border routers has no traffic shift. (50.3%)
![Page 12: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/12.jpg)
12
Event Category – “Internal Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
Internal Traffic Shift
“Internal Disruption”: all of the traffic shifts are internal traffic shift. (15.6%)
![Page 13: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/13.jpg)
13
Event Category – “Single External Disruption”
ISP
EA
p
EB
EC
EE
AS2
ED
AS1
external Traffic Shift
“Single External Disruption”: only one of the traffic shifts is external traffic shift. (20.7%)
![Page 14: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/14.jpg)
14
Statistics on Event Classification
Events Updates
No Disruption 50.3% 48.6%
Internal Disruption 15.6% 3.4%
Single External Disruption 20.7% 7.9%
Multiple External Disruption 7.4% 18.2%
Loss/Gain of Reachability 6.0% 21.9%
First 3 categories have significant variations from day to day
Updates per event depends on the type of events and the number of affected routers
![Page 15: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/15.jpg)
15
Event Correlation
Challenge: A single routing change affects multiple destination prefixes
EventCorrelation
EventCorrelation“Typed”
EventsClusters
Solution: group events of same type that occur close in time
![Page 16: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/16.jpg)
16
EBGP Session Reset Caused most “single external disruption” events Check if the number of prefixes using that
session as the best route changes dramatically
Validation with Syslog router report (95%)
time
Number of prefixes
session failure
session recovery
![Page 17: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/17.jpg)
17
Hot-Potato Changes Hot-Potato Changes
Caused “internal disruption” events Validation with OSPF measurement (95%)
[Teixeira et al – SIGMETRICS’ 04]
ISP
P
EA EB
EC
10119
“Hot-potato routing” = route to closest egress point
![Page 18: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/18.jpg)
18
Traffic Impact Prediction
Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations
Traffic ImpactPrediction
Traffic ImpactPrediction
EEBR
Clusters LargeDisruptions
Netflow Data
EEBR EEBR
Solution: weigh each cluster by traffic volume
![Page 19: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/19.jpg)
19
Traffic Impact Prediction
Traffic weight Per-prefix measurement from Netflow 10% prefixes accounts for 90% of traffic
Traffic weight of a cluster Sum of “traffic weight” of the prefixes A few clusters have large traffic weight Mostly session resets & hot-potato changes
![Page 20: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/20.jpg)
20
Performance Evaluation Memory
Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB
Speed 99% of intervals of 1 second of updates can
be process within 1 second Occasional execution lag Every interval of 70 seconds of updates can
be processed within 70 seconds
Measurements were based on 900MHz CPU
![Page 21: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/21.jpg)
21
Conclusion
BGP anomaly detection Fast, online fashion Operator concerns (reachability, flapping,
traffic) Significant information reduction
Uncovered important network behaviors Persistent flapping prefixes Hot-potato changes Session resets and interface failures
![Page 22: Bgp Anomaly Detection In An Isp](https://reader035.vdocuments.us/reader035/viewer/2022062706/55762dd2d8b42a015c8b47cc/html5/thumbnails/22.jpg)
22
Detecting Peering Violations Consistent export requirement
Peer should advertise prefixes at all peering points, with the same AS path length
Allows the AS to do hot-potato routing Detecting violations
Using iBGP feeds from the border routers Some inference tricks to identify inconsistencies
Results of the study http://www.nanog.org/mtg-0410/feamster.html http://www.cs.princeton.edu/~jrex/papers/imc04.pdf