1 effective diagnosis of routing disruptions from end systems ying zhang z. morley mao ming zhang
Post on 19-Jan-2018
217 Views
Preview:
DESCRIPTION
TRANSCRIPT
11
Effective Diagnosis of Routing Effective Diagnosis of Routing Disruptions from End SystemsDisruptions from End Systems
Ying Zhang Z. Morley Mao Ming Ying Zhang Z. Morley Mao Ming ZhangZhang
Src
Routing disruptions impact Routing disruptions impact application performanceapplication performance More applications today have high QoS requirementsMore applications today have high QoS requirements
Routing events can cause high loss and long delaysRouting events can cause high loss and long delays
AS BAS C
Internet
AS D
AS EDst
Existing approaches to diagnose Existing approaches to diagnose routing disruptions are ISP-centricrouting disruptions are ISP-centric
Require routing data from many routers in Require routing data from many routers in ISPs ISPs [Feldmann04, Teixeira04, Wu05][Feldmann04, Teixeira04, Wu05] Passive and accuratePassive and accurate
33
AS C
Internet
AS DAS B
BGP collectors
Limitations of ISP-centric Limitations of ISP-centric approachesapproaches
Difficult to gain access to data from many ISPsDifficult to gain access to data from many ISPs BGP data reflects “expected” data-plane pathsBGP data reflects “expected” data-plane paths
44
AS C
Internet
AS DAS B
End-systems
? ??
? ?? ?
ISP
Can we diagnose entirely from end Can we diagnose entirely from end systems?systems? Goal: infer data-plane paths of many routersGoal: infer data-plane paths of many routers
55
Dst
ISP AAS BAS C
AS D
Probing host
Our approach: end systems based Our approach: end systems based monitoringmonitoring Only require probing from end hostsOnly require probing from end hosts Cover all the Cover all the PoPPoPs of a target ISPs of a target ISP
66Dst
Target ISPAS B
AS C
AS D
Probing host
Our approach: end systems based Our approach: end systems based monitoringmonitoring Cover most of the destinations on the Cover most of the destinations on the
InternetInternet
77
ISP AAS BAS C
AS D
Probing hostDst
DstDst
Dst
Our approach: end systems based Our approach: end systems based monitoringmonitoring Identify routing changes by comparing Identify routing changes by comparing
paths measured consecutivelypaths measured consecutively
88Dst
ISP AAS BAS C
AS D
Probing host
Advantages and challengesAdvantages and challengesAdvantages:Advantages:
No need to access to ISP-propriety dataNo need to access to ISP-propriety dataIdentify actual data-plane pathsIdentify actual data-plane pathsMonitor data plane performanceMonitor data plane performance
Challenges:Challenges:Limited resources to probeLimited resources to probe
Coverage of probed pathsCoverage of probed pathsTiming granularityTiming granularity
Measurement noiseMeasurement noise
99
System architectureSystem architecture
1010
Event identification and classification
Collaborative probing
Event correlation and inference
Event impact analysis
Reports
Target ISP
Target ISP
Target ISP
OutlineOutline
Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation
1111
Collaborative probingCollaborative probing Using a set of hosts Using a set of hosts
To learn the routing state To learn the routing state To improve coverage To improve coverage To reduce overheadTo reduce overhead
1212
ISP AAS BAS C
AS D
Probing host
OutlineOutline
Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation
1313
Event classificationEvent classification Classify events according to ingress/egress Classify events according to ingress/egress
changeschanges
1414
Destination Prefix P
Target ISP
Probing host
Type1: Ingress PoP changesType2: Ingress PoP same, egress PoP different
Type3: Ingress PoP same, egress PoP same
OutlineOutline
Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation
1515
Likely causes: link failuresLikely causes: link failures
16161616
Destination Prefix P
Target ISPOld path New path
Probing host
Old egress PoP New egress PoP
Neighbor AS
Likely causes: internal distance Likely causes: internal distance changeschanges
17171717
distance: 120
Probing host
Old egress PoP New egress PoP
Hot potato changes Hot potato changes Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases
Neighbor AS
distance: 80distance: 100 distance: 120
Event correlationEvent correlation
Spatial correlation: a single network Spatial correlation: a single network failure often affects multiple routersfailure often affects multiple routers
Temporal correlation: routing events Temporal correlation: routing events occurring close together are likely occurring close together are likely due to only a few causesdue to only a few causes
1818
Inference methodologyInference methodology An evidence: an event that supports the An evidence: an event that supports the
causecause
1919
Destination prefix P
Target ISP Probing host
New path
Probing host
New egressCause: Link L is down
Link L
Inference methodologyInference methodology A conflict: a measurement trace that A conflict: a measurement trace that
conflicts with the causeconflicts with the cause
2020
Destination prefix P
Target ISP Probing host
New path
Probing host
New egressCause: Link L is down
Link L
Inference methodologyInference methodology
2121
Evidence node[1,2,3]->[1,2,4]
Cause: link 2-3 down
Cause: node 3 withdraws the
route
AS 1
AS 2
AS 3 AS 4Withdrawal
Inference methodologyInference methodology
2222
Evidence node[1,2,3]->[1,2,4]
Evidence node[0,2,3]->[0,2,4]
Cause: link 2-3 down
Cause: node 3 withdraws the
route
Evidence Graph
AS 1
AS 2
AS 3 AS 4
AS 0
Withdrawal
Inference methodologyInference methodology
2323
Conflict node[1,2,3,6]
Cause: link 2-3 down
Cause: node 3 withdraws the route
Conflict node[0,2,3,6]
Conflict Graph
Conflict node[0,2,3]
AS 1
AS 2
AS 3
AS 0
AS 6
Inference methodologyInference methodology
2424
Evidence node[1,2,3]->[1,2,4]
Evidence node[0,2,3]->[0,2,4]
Conflict node[1,2,3,6]
Conflict node[0,2,3,6]
Evidence Graph Conflict Graph
Conflict node[0,2,3]
Greedy algorithm: minimum set of causes that can Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflictsexplain all the evidence while minimizing conflicts
Evidence: 2Conflicts: 3
Evidence: 2Conflicts: 0
OutlineOutline
Collaborative probingCollaborative probingEvent identification and classificationEvent identification and classificationEvent correlation and inferenceEvent correlation and inferenceResult and validationResult and validation
2525
ISPs studiedISPs studied
2626
AS Name ASN (Tier)
Periods # of Src # of PoPs
# of Probes
Probe Gap
AT&T 3/23-4/9 230 111 61453 18.3 minVerio 4/10-4/22
9/13-9/22218 46 81024 19.3 min
Deutsche Telekom
4/23-5/22 149 64 27958 17.5 min
Savvis 5/23-6/24 178 39 40989 17.4 minAbilene 9/23-9/30
2/3-2/17113 11 51037 18.4 min
Results of event classificationResults of event classification Many events are internal changesMany events are internal changes Abilene has many ingress changesAbilene has many ingress changes
2727
Target AS
Total events (% all traces)
Diff egress
Same ingress, egress Diff ingressInternal
PoP pathExternal AS path
AT&T 0.35% 12.1% 51% 35% 11%Verio 0.31% 27.3% 48% 19% 9.8%Deutsche Telekom
0.66% 4.9% 8.5% 80.7% 7.2%
Savvis 0.35% 11% 45% 31% 14%Abilene 0.24% 13.6% 37% 40% 17%
Validation with BGP based Validation with BGP based approach [Wu05]approach [Wu05] Hot potato changes: egress point changes Hot potato changes: egress point changes
due to internal distance changes due to internal distance changes
2828
Hot potato changes
BGPbased
Our method
Both
Tier-1 AS 147 185 101(31%, 45%)
Abilene network
79 88 60(24%, 31%)
Number of incidences identified
by BGP method
Number of incidences identified
by our method
Number of incidences identified
by both
False negative,false positives
Validation with BGP based Validation with BGP based approachapproach Session resets: peering link up/downSession resets: peering link up/down Inaccuracy reasons:Inaccuracy reasons:
Limited coverageLimited coverage Coarse-grained probingCoarse-grained probing Measurement noiseMeasurement noise
2929
Session reset
BGPbased
Our method
Both
Tier-1 AS 9 15 6(33%, 50%)
Abilene network
7 11 7(0%, 36%)
System performanceSystem performance
Can keep up with generated routing Can keep up with generated routing statestate
Applicable for real-time diagnosis and Applicable for real-time diagnosis and mitigationmitigationReactive: construct alternate paths to Reactive: construct alternate paths to
bypass the problembypass the problemProactive: avoid paths with many historical Proactive: avoid paths with many historical
routing disruptionsrouting disruptions3030
ConclusionConclusion
Developed the first system to Developed the first system to diagnose routing disruptions purely diagnose routing disruptions purely from end systemsfrom end systems
Used a simple greedy algorithm on Used a simple greedy algorithm on two bipartite graphs to infer causestwo bipartite graphs to infer causes
Comprehensively validated the Comprehensively validated the accuracyaccuracy
3131
Thank you!Thank you!
Questions?Questions?
3232
Performance impact analysisPerformance impact analysis
End-to-end latency changes caused End-to-end latency changes caused by different types of routing eventsby different types of routing events
3333
Validation with BGP dataValidation with BGP data
BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISPBGP feeds from a Tier-1 ISP
The destination prefix coverage and the routing The destination prefix coverage and the routing event detection rateevent detection rate
3434
Target AS
Dst. Prefix coverage
Dst. Prefix traversing PoPs with BGP feeds
Detected events (AS change, next hop change)
Missed events(short-duration, filtering, other)
AT&T 15% 1.5% 11% (10.3%, 3.2%)
89% (75%, 13%, 1%)
Verio 18.6% 18.1% 23% (19.1%, 8.6%)
77% (73%, 4%, 0%)
Savvis 7.8% 1.1% 6% (5.8%, 0.5%) 94% (80%, 9%, 5%)Abilene
6% 6% 21% (17.3%, 5.8%)
79% (61%, 15%, 3%)
Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP
35353535
Target ISPOld path New path
Probing host
Old egress PoP New egress PoP
Policy changesPolicy changes Local preference in the old route decreasesLocal preference in the old route decreases Local preference in the new route increasesLocal preference in the new route increases
Neighbor ASLocal Pref : 100->50
Local Pref : 60->110
Event classification: Event classification: same ingress PoP, different egress same ingress PoP, different egress PoP PoP
36363636
Target ISPOld path New path
Probing host
Old egress PoP New egress PoP
External routing changesExternal routing changes Old route worsens due to external factors (withdrawal, longer Old route worsens due to external factors (withdrawal, longer
AS path)AS path) New route improves due to external factorsNew route improves due to external factors
AS AABCD->ABEFD BCEFD->BEFDAS B
Event classification: Event classification: same ingress PoP, same egress same ingress PoP, same egress PoP PoP Internal PoP path changesInternal PoP path changes
Cost of old internal path increasesCost of old internal path increases Cost of new internal path decreasesCost of new internal path decreases
External AS path changesExternal AS path changes
37373737
Destination Prefix P
Target ISPOld path New path
Probing host
Results of cause inferenceResults of cause inference
3838
Effectiveness of inference algorithmEffectiveness of inference algorithm Clusters: a group of events with the same root Clusters: a group of events with the same root
causecause
Event identificationEvent identification A routing event: path changesA routing event: path changes Event identificationEvent identificationomparing continuous routing snapshotsomparing continuous routing snapshots
3939Dst
ISP AAS BAS C
AS D
Probing host
top related