multisite internet data analysis
DESCRIPTION
Multisite Internet Data Analysis. Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor [email protected] http://www.eecs.umich.edu/~hero. Network Data Collection Distributed Data Analysis Dimension Reduction Model-Based Data Analysis Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/1.jpg)
Multisite Internet Data Analysis
Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor
[email protected]://www.eecs.umich.edu/~hero
Research supported in part by: NSF CCR-0325571
1. Network Data Collection2. Distributed Data Analysis
1. Dimension Reduction2. Model-Based Data
Analysis3. Conclusions
![Page 2: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/2.jpg)
1. Network Data Collection • Objectives
– Global: monitoring centers aggregate statistics from sites distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints
– Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis.
• Types of data measured– Active: queries and requests, packet probes– Passive: netflow, router fields, honeypots, backscatter
![Page 3: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/3.jpg)
ISP 1
ISP 2
ISP 3
Local data collectionand probing site
Monitoring Center
Datacollection site
: Data collector
![Page 4: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/4.jpg)
Abilene Netflow DataProtocol
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
![Page 5: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/5.jpg)
Abilene Netflow DataR
outer
No. Flows
Avg. Duration
Std. Duration
Avg Packets
Std. Packets
Avg Bytes
Std. Bytes
No.
Flo
ws
Avg
. Dur
atio
n
Std
. Dur
atio
n
Avg
Pac
kets
Std
. Pac
kets
Avg
Byt
es
Std
. Byt
es
Dat
aset
1
Dataset 2
![Page 6: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/6.jpg)
Abilene Netflow Data
0 20 40 60 80 100 120 140 160 180 2005.5
6
6.5
7
7.5
8x 10
4
Time in sec.
Num
of F
low
s
Total Number of Flows for Data Set 1
![Page 7: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/7.jpg)
Challenges and Approaches• Challenges
– High dimensional measurement space– Non-linear dependencies and non-stationarity– Privacy and proprietary concerns– Insufficient bandwidth for cts sampled data
• Approaches– Dimension reduction– Model-based distributed inference – Controlled information sharing– Hierarchical and modular collection/analysis
![Page 8: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/8.jpg)
Hierarchical Architecure
![Page 9: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/9.jpg)
2. Distributed Data Analysis
• Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold.
• Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing
Site ASite C
Site B
![Page 10: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/10.jpg)
Sampling
2.1 Distributed Dimension Reduction
UnknownDistribution
ObservedSample
UnknownManifold
UnknownEmbedding
![Page 11: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/11.jpg)
Geodesic Entropic GraphsA Planar Sample and its Euclidean MST
![Page 12: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/12.jpg)
GMST Dimension Estimation
0 200 400 600 800 1000 12000
1
2
3
4
5
6
7
8
9 x 105 MST Length for 3 Land Vehicles (=1,m=20)
n
L n
1020 1022 1024 1026 1028 1030 1032 1034 1036 10388.5
8.52
8.54
8.56
8.58
8.6
8.62
8.64 x 105 MST Length for 3 Land Vehicles (=1,m=20)
n
Ln
GMST Estimatesd=13H=120(bits)_
![Page 13: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/13.jpg)
Distributed GMST Estimator• Principal MST convergence result:
• Distributed BHH (Aggregation rule):
• Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites
BHH Theorem:
![Page 14: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/14.jpg)
2.2 Distributed Model-based Inference• Global likelihood model
• Global M-estimator recursion:
– Global Fisher score function
• Local Fisher score functions
![Page 15: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/15.jpg)
Distributed M-estimator
Compute Compute
k=k+1 k=k+1
A B
![Page 16: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/16.jpg)
Properties
• Communication requirement is: – 2p bytes/update/site.
• If data are independent attain stationary points of global likelihood
• All local MLE’s are available to each site.
• For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.
![Page 17: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/17.jpg)
Global maximum
Local maxima
x xx x xx xxxx x xx
Local MLE’s
Global Likelihood Function
![Page 18: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/18.jpg)
Key Theoretical Result• The asymptotic distribution of local estimates is a
Gaussian mixture dependent on global likelihood
• Parameters
M
mm
mTm
mKm tCtC
ptf0
1
2/ˆ )()(21exp
2)(
Proof: asymptotic normal theory of local maxima (Huber:67): see Blatt&Hero:2003
![Page 19: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/19.jpg)
SampleCovariance
Analysis
Local Estimator Aggregation Algorithm
Estimator 1
Estimator 2
Estimator N
Estimation of
Gaussian Mixture
Parameters
(FS,EM…)
AggregationTo FinalEstimate
![Page 20: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/20.jpg)
Local maximum
Ambiguity function.
Global maximum
IID Observation Model:
• Each site observes 2 component Gaussian mixture
• Identical component variances
• Unknown mixing parameters
• Unknown component means
• 200 data collection sites
• 100 samples/site• CEM2 algorithm
implemented for estimation and aggregation
Simple Example
![Page 21: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/21.jpg)
0 0.5 1 1.5 2 2.5 30
0.5
1
1.5
2
2.5
3
1
2
Clustering and Discrimination
Global maximum
Inverse FIM
Local maximum
Empirically estimated covariances via CEM2
![Page 22: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/22.jpg)
Validation of Key Result
QQ for Cluster 1 QQ for Cluster 2
![Page 23: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/23.jpg)
Conclusions
• Lossless distributed dimension reduction and model-based inference requires:– Reliable local inference methods – Aggregation rules for combining local statistics
• Information sharing constraints?• Effects of bandwidth constraints - data
compression? • Tracking in dynamical models?
![Page 24: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/24.jpg)
References
• A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002.
• J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004.
• D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003
• M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003
• N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.
![Page 25: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/25.jpg)
Information Sharing Game
![Page 26: Multisite Internet Data Analysis](https://reader035.vdocuments.us/reader035/viewer/2022062501/56815bef550346895dc9e039/html5/thumbnails/26.jpg)
Addition of other Discriminants
00.5
11.5
22.5
3
00.5
11.5
22.5
3-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1
2
lo
g f(
y i; 1,
2) - E
{ lo
g f(
y i; 1,
2) }
Value-added due totransmission of likelihood values