multisite internet data analysis alfred o. hero, clyde shih, david barsic university of michigan -...
Post on 21-Dec-2015
220 views
TRANSCRIPT
Multisite Internet Data Analysis
Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor
[email protected]://www.eecs.umich.edu/~hero
Research supported in part by: NSF CCR-0325571
1. Network Data Collection2. Distributed Data Analysis
1. Dimension Reduction2. Model-Based Data
Analysis
3. Conclusions
1. Network Data Collection
• Objectives– Global: monitoring centers aggregate statistics from sites
distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints
– Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis.
• Types of data measured– Active: queries and requests, packet probes– Passive: netflow, router fields, honeypots, backscatter
ISP 1
ISP 2
ISP 3
Local data collectionand probing site
Monitoring Center
Datacollection site
: Data collector
Abilene Netflow DataP
roto
col
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
Abilene Netflow DataR
ou
ter
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
Abilene Netflow Data
0 20 40 60 80 100 120 140 160 180 2005.5
6
6.5
7
7.5
8x 10
4
Time in sec.
Num
of
Flo
ws
Total Number of Flows for Data Set 1
Challenges and Approaches• Challenges
– High dimensional measurement space– Non-linear dependencies and non-stationarity– Privacy and proprietary concerns– Insufficient bandwidth for cts sampled data
• Approaches– Dimension reduction– Model-based distributed inference – Controlled information sharing– Hierarchical and modular collection/analysis
2. Distributed Data Analysis
• Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold.
• Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing
Site ASite C
Site B
Sampling
2.1 Distributed Dimension Reduction
UnknownDistribution
ObservedSample
UnknownManifold
UnknownEmbedding
GMST Dimension Estimation
0 200 400 600 800 1000 12000
1
2
3
4
5
6
7
8
9x 10
5 MST Length for 3 Land Vehicles (=1,m=20)
n
Ln
1020 1022 1024 1026 1028 1030 1032 1034 1036 10388.5
8.52
8.54
8.56
8.58
8.6
8.62
8.64x 10
5 MST Length for 3 Land Vehicles (=1,m=20)
n
Ln
GMST Estimatesd=13H=120(bits)_
Distributed GMST Estimator• Principal MST convergence result:
• Distributed BHH (Aggregation rule):
• Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites
BHH Theorem:
2.2 Distributed Model-based Inference• Global likelihood model
• Global M-estimator recursion:
– Global Fisher score function
• Local Fisher score functions
Properties
• Communication requirement is: – 2p bytes/update/site.
• If data are independent attain stationary points of global likelihood
• All local MLE’s are available to each site.
• For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.
Key Theoretical Result• The asymptotic distribution of local estimates is a
Gaussian mixture dependent on global likelihood
• Parameters
M
m
mm
Tm
mK
m tCtC
ptf
0
1
2/ˆ )()(
2
1exp
2)(
Proof: asymptotic normal theory of local maxima (Huber:67): see Blatt&Hero:2003
SampleCovariance
Analysis
Local Estimator Aggregation Algorithm
Estimator 1
Estimator 2
Estimator N
Estimation of
Gaussian Mixture
Parameters
(FS,EM…)
AggregationTo FinalEstimate
Local maximum
Ambiguity function.
Global maximum
IID Observation Model:
• Each site observes 2 component Gaussian mixture
• Identical component variances
• Unknown mixing parameters
• Unknown component means
• 200 data collection sites
• 100 samples/site• CEM2 algorithm
implemented for estimation and aggregation
Simple Example
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
1
2
Clustering and Discrimination
Global maximum
Inverse FIM
Local maximum
Empirically estimated covariances via CEM2
Conclusions
• Lossless distributed dimension reduction and model-based inference requires:– Reliable local inference methods – Aggregation rules for combining local statistics
• Information sharing constraints?
• Effects of bandwidth constraints - data compression?
• Tracking in dynamical models?
References
• A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002.
• J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004.
• D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003
• M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003
• N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.