an algebraic approach to practical and scalable overlay network monitoring yan chen, david bindel,...
Post on 20-Dec-2015
213 views
TRANSCRIPT
An Algebraic Approach to Practical and Scalable Overlay Network
Monitoring
Yan Chen, David Bindel, Hanhee Song, Randy H. Katz
Presented by Mahesh Balakrishnan
Motivation
Overlay networks Monitoring of end-to-end paths The need for a separate Monitoring Service Metrics: Latency... Loss Rate? The Goal: A Scalable Overlay Loss Rate
Monitoring Service
Existing Work…
Latency-only Schemes Clustering:
– Nodes are clustered together, and cluster representative is monitored
– Claim: Inaccurate for congestion detection
Co-ordinates:– Cannot give congestion information
Existing Work.
Network Tomography: Determining internal network properties from black-box measurements
Shavitt, et al. Algebraic approach Ozmutlu, et al. Selecting minimal set of paths to
cover all links
General Metric Systems: RON
Core Idea
Assumptions:– Access to link composition of paths– Ability to measure path (but not link) characteristics
From the possible n2 end-to-end paths, select a basis set of k paths (k << n2) to monitor.
The characteristics of all paths can be inferred from this basis set.
Centralized algorithm: all nodes send measurements to central node.
The Math
Eq 1: Represent paths as
vectors:
A
D
C
B
l1
l2
3p1
)1)(1(1 211 llp
0
1
1
v
)1log(
)1log(
)1log(
011
)1log()1log()1log(
3
2
1
211
l
l
l
llp
AD
BD
AC
System of Linear Equations
srG }1|0{
1 sRx
…
=
Path Matrix Link Rates Path Rates
1 rRb
Example Network
111
100
011
G
A
D
C
B
l1
l2
3p1
AB
AC
BC
bGx k = Number of essential paths 1 < k <= sG is rank deficient: k < s
More Math
k = # of essential paths
= rank (G) k <= s Usually G is rank-
deficient: k < s Select k linearly
independent paths to monitor:
bxG G
One-time QR Decomposition: O(rk2) time… O(n4)!
Inferring other paths: O(k2)
=…k
s
Assessment Criteria
Accuracy Scalability: How does k grow w.r.t n?
Other concerns:– centralized solution– compute time under churn– storage load
Effect of Topology on k growth
Star Topology, Strict Hierarchy: s = O(n), => k = O(n)
Clique: Each path (end host pair) contains a unique link, hence k = O(n2)
Hierarchy is good, Dense Connectivity is bad Conjecture: k = O(nlogn) for the internet What if only a small % of end nodes are on
overlay?
Linear Regression Tests
Synthetic Hierarchical Real
Handling Change
Path Addition: O(k2) Path Removal: O(k2) [Naïve : O(rk2) Node Addition: O(nk2) Node Removal: O(nk2)
– Cannot use path removal algorithm directly; path will be replaced using another path involving node
– Remove all paths, then look for replacements Cubic in n: Churn in large systems?
Routing Changes
End-to-end internet paths are generally stable
Traceroute Topology checked on a daily basis, in
presence of drastic loss rate changes If path has changed at certain links, other
paths with that link are checked as well
Load Balancing/Topology Measurement Errors
Paths in G are randomly reordered before basis set is selected
Untraceable paths/segments are modeled as single links; they always get selected in basis
Router aliases – one physical link presented as several virtual links – all virtual links get similar loss rates
Evaluation: Simulation
Three synthetic BRITE topologies: Barabasi-Albert, Waxman, hierarchical
One ‘real’ router topology (Mercator) Methodology:
– Loss Distribution: Good = 0-1%, Bad = 5-10%– Loss Model: Bernoulli, Gilbert
Simulate loss for selected paths, infer for other paths
Accuracy: Synthetic Topology
All Configurations under 0.008, 1.18
Accuracy: Real Topology
Accuracy
Real Topology
Synthetic Hierarchical Topology
Running Time
3 seconds for 100 nodes, 21 minutes for 500!
Load Balancing
Effect of Churn/Routing Change
Path Addition: 125 msec Path Removal: 445 msec Node Addition: 1.18 sec Node Removal: 16.9 sec What about n >> 60?
Node Deletion
Node Addition
Network Link Removal
PlanetLab Experiments
51 hosts, each from different organization Each node sends a UDP packet to every
other host in each trial 300 trials of 300 msec each Receiver counts packets for loss rate Traceroute used for topology measurement
PlanetLab Results
Cumulative coverage/FP Cumulative error (Worst Run)
Average Abs. Error = 0.0027, Average Error Factor 1.1
Effect of traffic on loss rates
Sensitivity Analysis done at night, on empty networks
Threshold at 12.8 Mbps
Why do this?
Conclusion
Algebraic Method for inferring loss rates of all paths from a basis set
Quite Accurate Reasonable load imposed on each node But is it really scalable? Centralized solution, cubic dependence on n
for handling node addition/removal