presentation by michael smathers, usman jafarey cs395/495 imre, april 24, 2006 planetseer: internet...

Presentation by Michael Smathers, Usman Jafarey

CS395/495 IMRE, April 24, 2006

PlanetSeer: Internet Path Failure Monitoring and Characterization in

Wide-Area Services

• Large volume of traffic data required to characterize misbehavior, wide-area services

–Peer-to-peer (P2P) systems–Content distribution networks (CDN)

• Solution: Combine passive monitoring of wide area networks with active probes to quantify and characterize anomalies.

Detecting Path Anomalies

• Traceroute only maps forward path; difficult to infer if problem is with forward or reverse path without destination cooperation.• BGP/OSPF propagate failure information. Traceroute may stop at a hop that is not the source of the failure.• High variance in failure duration makes it difficult to respond in time.• Few sites had enough coverage to identify all affected paths of a failure.

Traditional Detection…

• More accurate, complete view of failures thanks to geographical diversity of nodes• Minimum overhead; active probing is initiated only after passive monitoring detects anomaly• High rate of failure detection thanks to large volumes of traffic

Advantages of this approach

• Passively monitoring traffic on PlanetLab since February 2004 to detect anomalous behaviour

– Coordinate active probes between PlanetLab sites to confirm/characterize anomaly and measure scope

• ~90,000 anomalies confirmed each month with PlanetSeer.

PlanetLab Test Bed

Wide-area service network: CoDeeN• 7-12K clients/day• 100-200GB/day• 5-7 million requests/day• 120 nodes in North America(350 world-wide)

Passive Monitoring Daemons (MonD) run on all CoDeeN nodes to detect anomalous TCP traffic behaviour.

Active Probing Daemons (ProbeD) run on all PlanetLab nodes, including CoDeeN nodes, awaitingrequests from MonDs.

Components

1. MonD detects anomaly, sends request to local ProbeD.

2. ProbeD contacts ProbeDs on other nodes to coordinate planet-wide probe.

3. ProbeDs are organized in groups for distributed probe.

Operation

• Uses PlanetLab's tcpdump to observe all incomingand outgoing TCP packets.

• Uses this information to generate path and flow level statistics which are used to identify possible anomalies in real-time.

• Two indicators of anomalies: – Change in TTL(Time To Live) field– Multiple consecutive timeouts

Current threshold: 4 timeoutsIf MonD is on receiving side, ACKs not

reaching sender. We can assume forward path is at

fault.If MonD is sender, we cannot determine

from timeouts which path contains the problem.

MonD - Operation

When MonD is sender, maintain two variables for each flow:

• SendSeqNo, sequence number of most recently sent packet.

• SendRtxCount, count of times the packet has been retransmitted.

• CurrentSeqNo > SendSeqNo; flow is making progress, clear SendRtxCount and set SendSeqNo to current.

• CurrentSeqNo < SendSeqNo; fast retransmit. Set SendSeqNo to current.

• CurrentSeqNo = SendSeqNo, timeout; Increment SendRtxCount. If SendRtxCount exceeds threshold, MonD notifies ProbeD of possible anomaly.

MonD - Timeout Detection

MonD receiver side, maintain largest seq. no per flow. If current packet has same seq. no, increment counter.

When counter hits threshold notify ProbeD that sender is not seeing ACKs.

MonD - cont’d…

Three probing operations:

1. Baseline probes, run when new IP is added to MonD path table.

2. Forward probes, traceroutes invoked at multiple geographically distributed nodes when MonD detects anomaly. Rate limited, ProbeD will not forward probe the same destination more than once in 10 minutes.

3. Reprobes, if anomaly is confirmed by forward probe, reprobes sent by initial ProbeD to determine duration and effects of anomaly. Reprobes sent at .5, 1.5, 3.5 and 7.5 hours after anomaly detection time. Reprobes compared to original baseline and forward probes.

ProbeD

• 353 ProbeDs running on 145 PlanetLab sites.• Distributed across North/South America, Europe,

Asia and elsewhere.• Membership information kept for ProbeDs to avoid

unnecessary communication to dead nodes.• 30 ProbeD node groups based on geographic

diversity.• ProbeD receives request from local MonD, then

– forwards request to one ProbeD from each group– ProbeDs perform probe, send results to requester.– originator collects data

ProbeD - Operation

• 887,521 unique client IPs from 9232 ASes.• Probes traversed 10090 ASes. (over half the

ASes on the Internet) • 2,259,558 possible anomalies• 271,898 confirmed

ProbeD - Dataset

• Unusable hops identified by * in place of name, removed. Relative hop count maintained.

• Missing hops found by comparing traceroutes that share destination.

Repairing Traceroute Data

Anomily confirmed if any of the following conditions are met:

• There is a loop in the traceroute • Local traceroute disagrees with baseline• Local traceroute doesn't reach destination but

other traceroutes make it• Traceroute returns ICMP destination

unreachable

Anomoly Detection

• Detected if same sequence observed at least 3 times in a traceroute.

• Persistent loops, traceroute stays in loops until max hops.

• Temporary loops, loops resolved before max hops.

• Reprobes determine duration of persistent loop.

Routing Loops

• Number of routers/AS involved in loop.• Loop length – number of routers involved• Temporary loops longer lengths than

persistent• Persistent loops generally involve single AS• Loops mapped by tiers of AS involved

Measuring Scope

• Temporary loops overload routers• Persistent loops cause loss of connectivity• Degrade latency

Loop Effects

• Distinguish between forward/reverse anomalies• Scope of anomaly; hops between anomoly & end host• Classify as either path change or path outage

• Evaluating Reference Paths– Hazards; destination behind firewall, intermediate router

filtering– Firewall heuristics; choosing appropriate distance n between

host & anomaly• 0 < RevHop(dst) - RevHop(Sx) < n

Reference Paths

• Comparing reference path (R) with local path (L)– Path change; L reaches last hop of R– Path outage; L cuts out before R– Path outage + Path change; L diverges from R,

arrives at R’s last hop

• Breakdown of all anomalies observed:– Path Change: 48%– Forward Outage: 10%– Other: 24%– Temporary: 18%

Non-Loop Anomalies

• Define scope; # hops on R that could change next hop value

• Remote traceroute from various locations, find Intercept path– Intercept path narrows scope

• Find relative location of anomaly, i.e. near host– Find distance of path change by average distances

of all paths in scope

Path Changes

• Distinguish between forward, reverse paths• Forward path:

– Route change on forward path, in addition to outage– ICMP dest. Unreachable– Reported as timeout on forward path by MonD

• 35% anomalies found to be Fwd Timeout (inferred by MonD)

– Indistiguishable without passive/active probes

Path Outage

Path Change Detection - AS

• How many failures can be bypassed?– For all clients with reference path, 62815

reachability failures– Of these, PlanetSeer nodes able to reach

destination in 27263 cases (43% of failures)– Same results achieved using 15 vantage points as

all 30

• Bypass ratio; minimum RTT of any bypass path and RTT of baseline path– Improves latency in 23% of new paths

Bypassing Anomalies

• BGP – misconfiguration classification– Locate origin via time, prefix, view

• Traceroute; Path symmetry; 49% asymmetric, 91% persist for more than several hours

• Ping/Traceroute hybrids

Related Work

• Passive Monitoring– Enables must faster detection of anomalies

– Better resolution, temporary anomaly detection

• Failure distribution (AS topology)– Tier 1 most stable, Tier 3 least stable

• Loop Behaviour– Temporary loops have much longer lengths

– Most span 4 routers

• Path Change resolution– 63% of outages occur within 3 hops of end host

– Over half confined to 2 AS’s, 50% confined within 3 hops

• Alternate path discovery– Largely unsuccessful, most outages near network edge lack any

redundancy

Conclusions

presentation by michael smathers, usman jafarey cs395/495 imre, april 24, 2006 planetseer: internet...

Documents

forward path

codeen nodes

planetlab nodes

path difficult

currentseqno sendseqno

reverse path

failure information

probed of possible anomaly