presentation by michael smathers, usman jafarey cs395/495 imre, april 24, 2006 planetseer: internet...

26
Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide- Area Services

Upload: sylvia-richard

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

Presentation by Michael Smathers, Usman Jafarey

CS395/495 IMRE, April 24, 2006

PlanetSeer: Internet Path Failure Monitoring and Characterization in

Wide-Area Services

Page 2: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Large volume of traffic data required to characterize misbehavior, wide-area services

–Peer-to-peer (P2P) systems–Content distribution networks (CDN)

• Solution: Combine passive monitoring of wide area networks with active probes to quantify and characterize anomalies.

Detecting Path Anomalies

Page 3: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Traceroute only maps forward path; difficult to infer if problem is with forward or reverse path without destination cooperation.• BGP/OSPF propagate failure information. Traceroute may stop at a hop that is not the source of the failure.• High variance in failure duration makes it difficult to respond in time.• Few sites had enough coverage to identify all affected paths of a failure.

Traditional Detection…

Page 4: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• More accurate, complete view of failures thanks to geographical diversity of nodes• Minimum overhead; active probing is initiated only after passive monitoring detects anomaly• High rate of failure detection thanks to large volumes of traffic

Advantages of this approach

Page 5: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Passively monitoring traffic on PlanetLab since February 2004 to detect anomalous behaviour

– Coordinate active probes between PlanetLab sites to confirm/characterize anomaly and measure scope

• ~90,000 anomalies confirmed each month with PlanetSeer.

PlanetLab Test Bed

Page 6: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

Wide-area service network: CoDeeN• 7-12K clients/day• 100-200GB/day• 5-7 million requests/day• 120 nodes in North America(350 world-wide)

Passive Monitoring Daemons (MonD) run on all CoDeeN nodes to detect anomalous TCP traffic behaviour.

Active Probing Daemons (ProbeD) run on all PlanetLab nodes, including CoDeeN nodes, awaitingrequests from MonDs.

Components

Page 7: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

1. MonD detects anomaly, sends request to local ProbeD.

2. ProbeD contacts ProbeDs on other nodes to coordinate planet-wide probe.

3. ProbeDs are organized in groups for distributed probe.

Operation

Page 8: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Uses PlanetLab's tcpdump to observe all incomingand outgoing TCP packets.

• Uses this information to generate path and flow level statistics which are used to identify possible anomalies in real-time.

• Two indicators of anomalies: – Change in TTL(Time To Live) field– Multiple consecutive timeouts

Current threshold: 4 timeoutsIf MonD is on receiving side, ACKs not

reaching sender. We can assume forward path is at

fault.If MonD is sender, we cannot determine

from timeouts which path contains the problem.

MonD - Operation

Page 9: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

When MonD is sender, maintain two variables for each flow:

• SendSeqNo, sequence number of most recently sent packet.

• SendRtxCount, count of times the packet has been retransmitted.

• CurrentSeqNo > SendSeqNo; flow is making progress, clear SendRtxCount and set SendSeqNo to current.

• CurrentSeqNo < SendSeqNo; fast retransmit. Set SendSeqNo to current.

• CurrentSeqNo = SendSeqNo, timeout; Increment SendRtxCount. If SendRtxCount exceeds threshold, MonD notifies ProbeD of possible anomaly.

MonD - Timeout Detection

Page 10: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

MonD receiver side, maintain largest seq. no per flow. If current packet has same seq. no, increment counter.

When counter hits threshold notify ProbeD that sender is not seeing ACKs.

MonD - cont’d…

Page 11: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

Three probing operations:

1. Baseline probes, run when new IP is added to MonD path table.

2. Forward probes, traceroutes invoked at multiple geographically distributed nodes when MonD detects anomaly. Rate limited, ProbeD will not forward probe the same destination more than once in 10 minutes.

3. Reprobes, if anomaly is confirmed by forward probe, reprobes sent by initial ProbeD to determine duration and effects of anomaly. Reprobes sent at .5, 1.5, 3.5 and 7.5 hours after anomaly detection time. Reprobes compared to original baseline and forward probes.

ProbeD

Page 12: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• 353 ProbeDs running on 145 PlanetLab sites.• Distributed across North/South America, Europe,

Asia and elsewhere.• Membership information kept for ProbeDs to avoid

unnecessary communication to dead nodes.• 30 ProbeD node groups based on geographic

diversity.• ProbeD receives request from local MonD, then

– forwards request to one ProbeD from each group– ProbeDs perform probe, send results to requester.– originator collects data

ProbeD - Operation

Page 13: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• 887,521 unique client IPs from 9232 ASes.• Probes traversed 10090 ASes. (over half the

ASes on the Internet) • 2,259,558 possible anomalies• 271,898 confirmed

ProbeD - Dataset

Page 14: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Unusable hops identified by * in place of name, removed. Relative hop count maintained.

• Missing hops found by comparing traceroutes that share destination.

Repairing Traceroute Data

Page 15: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

Anomily confirmed if any of the following conditions are met:

• There is a loop in the traceroute • Local traceroute disagrees with baseline• Local traceroute doesn't reach destination but

other traceroutes make it• Traceroute returns ICMP destination

unreachable

Anomoly Detection

Page 16: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Detected if same sequence observed at least 3 times in a traceroute.

• Persistent loops, traceroute stays in loops until max hops.

• Temporary loops, loops resolved before max hops.

• Reprobes determine duration of persistent loop.

Routing Loops

Page 17: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Number of routers/AS involved in loop.• Loop length – number of routers involved• Temporary loops longer lengths than

persistent• Persistent loops generally involve single AS• Loops mapped by tiers of AS involved

Measuring Scope

Page 18: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Temporary loops overload routers• Persistent loops cause loss of connectivity• Degrade latency

Loop Effects

Page 19: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Distinguish between forward/reverse anomalies• Scope of anomaly; hops between anomoly & end host• Classify as either path change or path outage

• Evaluating Reference Paths– Hazards; destination behind firewall, intermediate router

filtering– Firewall heuristics; choosing appropriate distance n between

host & anomaly• 0 < RevHop(dst) - RevHop(Sx) < n

Reference Paths

Page 20: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Comparing reference path (R) with local path (L)– Path change; L reaches last hop of R– Path outage; L cuts out before R– Path outage + Path change; L diverges from R,

arrives at R’s last hop

• Breakdown of all anomalies observed:– Path Change: 48%– Forward Outage: 10%– Other: 24%– Temporary: 18%

Non-Loop Anomalies

Page 21: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Define scope; # hops on R that could change next hop value

• Remote traceroute from various locations, find Intercept path– Intercept path narrows scope

• Find relative location of anomaly, i.e. near host– Find distance of path change by average distances

of all paths in scope

Path Changes

Page 22: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Distinguish between forward, reverse paths• Forward path:

– Route change on forward path, in addition to outage– ICMP dest. Unreachable– Reported as timeout on forward path by MonD

• 35% anomalies found to be Fwd Timeout (inferred by MonD)

– Indistiguishable without passive/active probes

Path Outage

Page 23: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

Path Change Detection - AS

Page 24: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• How many failures can be bypassed?– For all clients with reference path, 62815

reachability failures– Of these, PlanetSeer nodes able to reach

destination in 27263 cases (43% of failures)– Same results achieved using 15 vantage points as

all 30

• Bypass ratio; minimum RTT of any bypass path and RTT of baseline path– Improves latency in 23% of new paths

Bypassing Anomalies

Page 25: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• BGP – misconfiguration classification– Locate origin via time, prefix, view

• Traceroute; Path symmetry; 49% asymmetric, 91% persist for more than several hours

• Ping/Traceroute hybrids

Related Work

Page 26: Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area

• Passive Monitoring– Enables must faster detection of anomalies

– Better resolution, temporary anomaly detection

• Failure distribution (AS topology)– Tier 1 most stable, Tier 3 least stable

• Loop Behaviour– Temporary loops have much longer lengths

– Most span 4 routers

• Path Change resolution– 63% of outages occur within 3 hops of end host

– Over half confined to 2 AS’s, 50% confined within 3 hops

• Alternate path discovery– Largely unsuccessful, most outages near network edge lack any

redundancy

Conclusions