multisite internet data analysis alfred o. hero, clyde shih, david barsic university of michigan -...

Multisite Internet Data Analysis

Alfred O. Hero, Clyde Shih, David Barsic University of Michigan - Ann Arbor

[email protected]://www.eecs.umich.edu/~hero

Research supported in part by: NSF CCR-0325571

1. Network Data Collection2. Distributed Data Analysis

1. Dimension Reduction2. Model-Based Data

Analysis

3. Conclusions

1. Network Data Collection

• Objectives– Global: monitoring centers aggregate statistics from sites

distributed around network to detect, classify, or estimate global network state while ensuring information privacy constraints

– Local: collection sites gather data relevant to local network state and share information as necessary to enhance local analysis.

• Types of data measured– Active: queries and requests, packet probes– Passive: netflow, router fields, honeypots, backscatter

ISP 1

ISP 2

ISP 3

Local data collectionand probing site

Monitoring Center

Datacollection site

: Data collector

Abilene Netflow DataP

roto

col

No. Flows

Avg. Duration

Std. Duration

Avg Packets

Std. Packets

Avg Bytes

Std. Bytes

No.

Flo

ws

Avg

. Dur

atio

n

Std

. Dur

atio

n

Avg

Pac

kets

Std

. Pac

kets

Avg

Byt

es

Std

. Byt

es

Dat

aset

1

Dataset 2

Abilene Netflow DataR

ou

ter

No. Flows

Avg. Duration

Std. Duration

Avg Packets

Std. Packets

Avg Bytes

Std. Bytes

No.

Flo

ws

Avg

. Dur

atio

n

Std

. Dur

atio

n

Avg

Pac

kets

Std

. Pac

kets

Avg

Byt

es

Std

. Byt

es

Dat

aset

1

Dataset 2

Abilene Netflow Data

0 20 40 60 80 100 120 140 160 180 2005.5

6

6.5

7

7.5

8x 10

4

Time in sec.

Num

of

Flo

ws

Total Number of Flows for Data Set 1

Challenges and Approaches• Challenges

– High dimensional measurement space– Non-linear dependencies and non-stationarity– Privacy and proprietary concerns– Insufficient bandwidth for cts sampled data

• Approaches– Dimension reduction– Model-based distributed inference – Controlled information sharing– Hierarchical and modular collection/analysis

Hierarchical Architecure

2. Distributed Data Analysis

• Hypothesis: data collected at sites A,B,C follow a statistical distribution defined over a lower dimensional manifold.

• Overall objective: Find distributed strategies to perform reliable statistical inference with minimum amount of data sharing

Site ASite C

Site B

Sampling

2.1 Distributed Dimension Reduction

UnknownDistribution

ObservedSample

UnknownManifold

UnknownEmbedding

Geodesic Entropic GraphsA Planar Sample and its Euclidean MST

GMST Dimension Estimation

0 200 400 600 800 1000 12000

1

2

3

4

5

6

7

8

9x 10

5 MST Length for 3 Land Vehicles (=1,m=20)

n

Ln

1020 1022 1024 1026 1028 1030 1032 1034 1036 10388.5

8.52

8.54

8.56

8.58

8.6

8.62

8.64x 10

5 MST Length for 3 Land Vehicles (=1,m=20)

n

Ln

GMST Estimatesd=13H=120(bits)_

Distributed GMST Estimator• Principal MST convergence result:

• Distributed BHH (Aggregation rule):

• Tight upper and lower bounds on limit: if exchange rooted dual graphs [Yukich:97] among sites

BHH Theorem:

2.2 Distributed Model-based Inference• Global likelihood model

• Global M-estimator recursion:

– Global Fisher score function

• Local Fisher score functions

Distributed M-estimator

Compute Compute

k=k+1 k=k+1

A B

Properties

• Communication requirement is: – 2p bytes/update/site.

• If data are independent attain stationary points of global likelihood

• All local MLE’s are available to each site.

• For multimodal likelihood, improvement on local MLE’s can be achieved by aggregation under mixture model.

Global maximum

Local maxima

x xx x xx xxxx x xx

Local MLE’s

Global Likelihood Function

Key Theoretical Result• The asymptotic distribution of local estimates is a

Gaussian mixture dependent on global likelihood

• Parameters

M

m

mm

Tm

mK

m tCtC

ptf

0

1

2/ˆ )()(

2

1exp

2)(

Proof: asymptotic normal theory of local maxima (Huber:67): see Blatt&Hero:2003

SampleCovariance

Analysis

Local Estimator Aggregation Algorithm

Estimator 1

Estimator 2

Estimator N

Estimation of

Gaussian Mixture

Parameters

(FS,EM…)

AggregationTo FinalEstimate

Local maximum

Ambiguity function.

Global maximum

IID Observation Model:

• Each site observes 2 component Gaussian mixture

• Identical component variances

• Unknown mixing parameters

• Unknown component means

• 200 data collection sites

• 100 samples/site• CEM2 algorithm

implemented for estimation and aggregation

Simple Example

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

1

2

Clustering and Discrimination

Global maximum

Inverse FIM

Local maximum

Empirically estimated covariances via CEM2

Validation of Key Result

QQ for Cluster 1 QQ for Cluster 2

Conclusions

• Lossless distributed dimension reduction and model-based inference requires:– Reliable local inference methods – Aggregation rules for combining local statistics

• Information sharing constraints?

• Effects of bandwidth constraints - data compression?

• Tracking in dynamical models?

References

• A. O. Hero, B. Ma, O. Michel and J. D. Gorman, “Application of entropic graphs,” IEEE Signal Processing Magazine, Sept 2002.

• J. Costa and A. O. Hero, “Manifold learning with geodesic minimal spanning trees,” accepted in IEEE T-SP (Special Issue on Machine Learning), 2004.

• D. Blatt and A. Hero, "Asymptotic distribution of log-likelihood maximization based algorithms and applications," in Energy Minimization Methods in Computer Vision and Pattern Recognition (EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J. Zerubia, Springer-Verlag, 2003

• M.F. Shih and A. O. Hero, "Unicast-based inference of network link delay distributions using mixed finite mixture models," IEEE T-SP, vol. 51, No. 9, pp. 2219-2228, Aug. 2003

• N. Patwari, A. O. Hero, and Brian Sadler, "Hierarchical censoring sensors for Change Detection,” Proc. Of SSP, St. Louis, Sept. 2003.

Information Sharing Game

Addition of other Discriminants

00.5

11.5

22.5

3

0

0.5

1

1.5

2

2.5

3-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1

2

lo

g f(

y i; 1,

2) -

E{

log

f(y i;

1, 2)

}

Value-added due totransmission of likelihood values

multisite internet data analysis alfred o. hero, clyde shih, david barsic university of michigan -...

Documents

data collector slide

abilene netflow data

local data collection

data relevant

types of data

ab slide

backscatter slide

data sharing site