iepm-bw: bandwidth change detection and traceroute analysis and visualization

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and VisualizationConnie Logg, Joint Techs Workshop February 4-9, 2006

BW Change Detection: ImportantKnow what you are looking forHow long must a change persist before alerting?What threshold to use for alerting (drop of N %)?What probes provide quality data and are relevant? May differ between network types and technologiesOnce an alert is detected, what circumstances must be met before another alert is generated for the same or new drop?Alerting and forecasting/predicting future performance are two different things however data taken may be relevant to bothRemember Dont want to respond to every little glitch more probing may escalate a minor momentary congestion event.

What to do with ALERTSStudy them for accuracy and relevanceWhat information would help diagnose the drop? Were there traceroute changes?Do changes in other probes seem to have occurred in the same time frame?Was there an increase in the ping RTT times?If TCP RTT is available, was there a change in that?What does OWAMP show (to be implemented)

Algorithm - SimplifiedStream of data t0 - - - tn2 buffers: history buffer (hbuff) and trigger buffer (tbuff), sizes hmax & tmaxLoad data t0-thmax into history buffer and calculate baseline histmean(hm) & histsd(hsd)

Algorithm - SimplifiedLoop over data t = {thmax+1 - - - tn}if t > hm -2*hsd, tbuffoldest->hbuff, t->hbuff, drop hbuffoldest ,calc hm & hsd, nextIf ttbuffIf size(tbuff) < tmax, next;Calc tbuff mean (tm), if (hm-tm)/hm > threshold, generate an alert, tbuff -> hbuff, calc hm, hsd, nextOnce alert is generated, drop threshold must be met again from the tm or the data stream must recover for of drop time.

OverviewWhat we currently look forLook for a drop lasting at least 6 hoursLook for a drop of 33%Before reporting another drop, require 3 hours of restored throughput

Time Bandwidth33% drop 6 hours33% drop 6 more hoursUp for at least 3 hoursDrop of 33% for 6 hours

Observations:Traceroute changes occasionally coincide with bandwidth dropsChallenge: How do you defined a traceroute change and which have most priority?Checksum errorDuplicate responding or non responding hop! AnnotationsIP addr differ in 4th octet (or 3rd and 4th octets)How do you quickly review traceroute changes?

Traceroute VisualizationOne compact page per dayOne row per host, one column per hourOne character per traceroute to indicate pathology or change (period(.) = no change)Identify unique routes with a numberInspect the route associated with a route numberProvide for analysis of long term route evolutions Route # at start of day, gives idea of route stabilityMultiple route changes (due to GEANT), later restored to original routePeriod (.) means no change

Pathology EncodingsStutterProbe typeEnd host not pingableICMP checksumChange in only 4th octetHop does not respondNo change! Annotation (!X)Change but same AS

Navigationtraceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms 13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X #rt# firstseen lastseen route0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

AS information

Changes in network topology (BGP) can result in dramatic changes in performance Snapshot of traceroute summary table Samples of traceroute trees generated from the tableABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01amDrop in performance(From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech )Back to original pathChanges detected by IEPM-Iperf and AbWEEsnet-LosNettos segment in the path(100 Mbits/s)HourRemote hostDynamic BW capacity (DBC)Cross-traffic (XT)Available BW = (DBC-XT)Mbits/sNotes:1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:002. ESnet/GEANT working on routes from 2:00 to 14:003. A previous occurrence went un-noticed for 2 months4. Next step is to auto detect and notifyLos-Nettos (100Mbps)

New Graphical Map DisplayNew Traceroute Map Display

Quality Control Bandwidth MonitoringIt is good to have a local target host for a sanity check:

Problem here was that the monitoring host rebooted into single CPU mode after maintenance had been performed on it.

More Sanity ChecksTarget host iepm-bw@caltech was not completely installed process cleanup did not have the perl modules that it needed to kill lingering processes (needs install check)

Probe CorrelationPathchirp analysis shows dropMulti-stream iperf show dropSingle stream iperf shows dropTraceroute change affected all 3

Analysis ResultsEmail is sent to interested parties with links to graphs, data and traceroute analysisAlerts are saved in the ALERT table and graphs are saved in the GRAPH Table for future referenceEvery analysis run, about every 2 hours, a table showing which alerts occurred for which probes and when is generated. It has links to the more detailed alert information.Reports are generated nightly for the last month alerts from these tables.

Future ImprovementsIntegrate Ping RTTmin and RTTmax analysis Optimize code for speed of execution estimate mean and std devUpload alerts to MonALISA What info?Compare detection algorithms (KS, HW, PCA?)Recommendations on data taking frequencies and how to define the trigger and history buffer sizes still needs more exploringImplement prediction/forecasting algorithm(s)QUESTIONS?

iepm-bw: bandwidth change detection and traceroute analysis and visualization

Documents