toyota infotechnology center u.s.a, inc. 1 mixture models of end-host network traffic john mark...

Toyota InfoTechnology Center U.S.A, Inc. 1

Mixture Models of End-host Network Traffic

John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft and Daniel Ting

Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook

Toyota InfoTechnology Center U.S.A, Inc.

Outline

We collected traffic at the end-host; something rarely monitored. Conventional distributions don’t fit heavy tailed data The dense part of the distribution doesn’t look Pareto, & just fitting the Pareto tail

doesn’t describe the data. Fit by mixture models – but not the typical Gaussian mixtures – of a Pareto tail

with exponentials as a proxy for the dense part. Model Selection – best number of components constrained by complexity penalty

& returns a model of the entire distribution. Uses:

Better tail parameter estimates than conventional measures. Soft clustering – assign traffic to exponential v/s Pareto components, by

protocol More stable threshold setting

Data collection effort

End-host flows: Collected at Laptop network port Collection moved around with device Assembled from packet trace headers On enterprise XP build Periodic server uploads Logged with user & CPU activity, to

eliminate off periods.

Data Sets:270 personal machine data sets 90% laptops5 week duration400G raw data, total.Flow initiation counts are binned

in intervals from 4 to 512 seconds

Removed zero-count intervalsMedian sample 9800 pointsMax sample size 264k

Heavy tailed data is extremely wide compared to conventional distributions.

Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.

Fitting a mixture of exponential families requires an impractical number of components.

But just fitting the power law tail ignores most of the probability mass

Best fit normal

Heavy tailed data is extremely wide compared to conventional distributions.

Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.

Fitting a mixture of exponential families requires an impractical number of components.

But just fitting the power law tail ignores most of the probability mass

Best fit normal

The distribution looks like an exponential above and a power law below

good fit

bad fit

good fit

Power law fit Exponential fit

Exponential – Pareto mixture models.

A mixture model is a hierarchical model where the mixing weights determine the probability of each of the component models, which in turn generate the sample points.

Since all components share the same support, any sample point could in principle have been generated by any component, by its mixing probability.

We consider three models:Pareto: One power-law component

Exponential – Pareto: One of each

2 Exponentials, one Pareto:

Any more exponential components cannot be resolved.

By modeling the entire data set, mixture models give more accurate tail α-parameter estimates than methods that consider only the tail data.

When tested on synthetic Pareto-tailed data, EP mixture model estimator performs significantly better than the well-known AEST method. (AEST estimates are shown on the left, and EP-based estimates on the right in each pane.)

Model Selection versus Goodness-of-Fit

Goodness-of-fit tests, while useful for initial characterization, don’t have an explicit acceptance criterion, and, as data set size increases, will eventually reject all models.

A Model selection is a relative, pairwise criterion that derives from comparison of likelihoods.

We use the Bayes Information Criterion to approximate the Bayes Factor terms. It penalizes the maximum likelihood by the model degrees of freedom, d, so that models of different number of parameters can be compared.

The Bayes Factor is the ratio of the marginal likelihood of one model (EP) to another (P). For instance a log Bayes Factor of 5 indicates the probability of the data given one model versus the other is over a 100:1.

With the BIC approximation, the log Bayes Factor becomes

Pairwise BIC comparisons of the reveal large log BF values for EP vs P and smaller values for EEP vs EP

Boxplot of BIC comparison for Pareto vs. EP Mixture Model.

Boxplot of BIC comparison for EP vs. EEP Mixture Model.

Model Selection Results

Model selection results based on Bayes Factors, over all users. Each bar represents the same user set with a different binning time window.

For the P, EP, and EEP models -- P: Only a handful of users are given the Pareto-only model, EP: Overall, the EP model is selected for 50-85% of the users, depending upon the bin size, andEEP: Between 15%-40% of user machines are best modeled by EEP, again depending upon the bin size.

Histograms of Heavy-Tail Parameters’ Variation, EP Model.

• The difference across users is significant.

Partitioning traffic into Exponential and Pareto ranges

Mixture fractions as a function of connections indicate (soft) membership of the data into a component.

In this example, bins with less than 82 counts are almost entirely exponential, and those with greater than 82, almost entirely Pareto.

This way different sources of the traffic can be characterized as heavy-tailed or not.

Mixture Fractions, User 256

mPareto

P(traffic)

Traffic Fractions, in Exponential and Pareto Components, by Protocol

Although Exponential traffic dominates in all cases, the long tail (i.e. Pareto) traffic appears largely from bursts of ICMP, DNS and web traffic flows.

In summary

1. We have modeled traffic as flow initiations from end hosts in an enterprise,using mixture models, employing model selection.

2. We have discoveredStrong evidence that the traffic, is almost always heavy-tailed, with the Pareto component contributing about 1/4 of the probability mass. and with power law scaling parameter with mean = 1.6

that varies widely, between 1.0 and 2.0.

3. Apparently DNS, ICMP and some web traffic make up the tail component.

http://arxiv.org/abs/1212.2744

See the full paper at

BACKUP

Pareto & Exponential components of selected users

Toyota InfoTechnology Center U.S.A, Inc. 18

Anomaly thresholds derived from models are more stable than empirical thresholds.

Component parameters are independent

This implies that the exponential and Pareto components are generated by separate sources.

toyota infotechnology center u.s.a, inc. 1 mixture models of end-host network traffic john mark...

exponential tail

exponential fit slide

exponential components

toyota infotechnology

power law tail

pareto tail doesnt

best fit normal slide

mixture models of end

Documents

meghan agosta hockey - school district 43 coquitlam agosta...

mining anomalies in network-wide flow data anukool lakhina...

talents up crovella sergio 20_05_2013

on the geographic location of internet resources mark...

sip-adus activities report · sip-adus activities report ....

archives departementales des pyrenees-orientales ·...

michael agosta brochure

empathy and intersubjectivity - listening with empathy ·...

network traffic modeling mark crovella boston university...

larissa spinelli mark crovella boston university

on the marginal utility of network topology measurements ›...

capítulo 3 crovella, m, krishnamurthy, b. internet...

understanding geolocation accuracy using network geometry...

critical path analysis of tcp transactions authors:paul...

outsourcing performance day · open line mybrand macaw...

mrs. agosta, mrs. hassett and ms. susan m. pojer

routing state distance: a path-based metric for network...

streaming estimation of information-theoretic metrics for...

roll working group ietf 72 in-vehicle routing requirements...

gims update - gec 10 charles thomas paul barford, joel...