toyota infotechnology center u.s.a, inc. 1 mixture models of end-host network traffic john mark...
Post on 16-Dec-2015
215 Views
Preview:
TRANSCRIPT
Toyota InfoTechnology Center U.S.A, Inc. 1
Mixture Models of End-host Network Traffic
John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft and Daniel Ting
Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook
Toyota InfoTechnology Center U.S.A, Inc.
Outline
We collected traffic at the end-host; something rarely monitored. Conventional distributions don’t fit heavy tailed data The dense part of the distribution doesn’t look Pareto, & just fitting the Pareto tail
doesn’t describe the data. Fit by mixture models – but not the typical Gaussian mixtures – of a Pareto tail
with exponentials as a proxy for the dense part. Model Selection – best number of components constrained by complexity penalty
& returns a model of the entire distribution. Uses:
Better tail parameter estimates than conventional measures. Soft clustering – assign traffic to exponential v/s Pareto components, by
protocol More stable threshold setting
2
Toyota InfoTechnology Center U.S.A, Inc.
Data collection effort
End-host flows: Collected at Laptop network port Collection moved around with device Assembled from packet trace headers On enterprise XP build Periodic server uploads Logged with user & CPU activity, to
eliminate off periods.
Data Sets:270 personal machine data sets 90% laptops5 week duration400G raw data, total.Flow initiation counts are binned
in intervals from 4 to 512 seconds
Removed zero-count intervalsMedian sample 9800 pointsMax sample size 264k
3
Toyota InfoTechnology Center U.S.A, Inc.
Heavy tailed data is extremely wide compared to conventional distributions.
Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.
Fitting a mixture of exponential families requires an impractical number of components.
But just fitting the power law tail ignores most of the probability mass
4
Best fit normal
Toyota InfoTechnology Center U.S.A, Inc.
Heavy tailed data is extremely wide compared to conventional distributions.
Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.
Fitting a mixture of exponential families requires an impractical number of components.
But just fitting the power law tail ignores most of the probability mass
5
Best fit normal
Toyota InfoTechnology Center U.S.A, Inc.
The distribution looks like an exponential above and a power law below
6
good fit
bad fit
bad fit
good fit
Power law fit Exponential fit
Toyota InfoTechnology Center U.S.A, Inc.
Exponential – Pareto mixture models.
A mixture model is a hierarchical model where the mixing weights determine the probability of each of the component models, which in turn generate the sample points.
Since all components share the same support, any sample point could in principle have been generated by any component, by its mixing probability.
We consider three models:Pareto: One power-law component
Exponential – Pareto: One of each
2 Exponentials, one Pareto:
Any more exponential components cannot be resolved.
7
Toyota InfoTechnology Center U.S.A, Inc.
By modeling the entire data set, mixture models give more accurate tail α-parameter estimates than methods that consider only the tail data.
8
When tested on synthetic Pareto-tailed data, EP mixture model estimator performs significantly better than the well-known AEST method. (AEST estimates are shown on the left, and EP-based estimates on the right in each pane.)
Toyota InfoTechnology Center U.S.A, Inc.
Model Selection versus Goodness-of-Fit
Goodness-of-fit tests, while useful for initial characterization, don’t have an explicit acceptance criterion, and, as data set size increases, will eventually reject all models.
A Model selection is a relative, pairwise criterion that derives from comparison of likelihoods.
We use the Bayes Information Criterion to approximate the Bayes Factor terms. It penalizes the maximum likelihood by the model degrees of freedom, d, so that models of different number of parameters can be compared.
9
The Bayes Factor is the ratio of the marginal likelihood of one model (EP) to another (P). For instance a log Bayes Factor of 5 indicates the probability of the data given one model versus the other is over a 100:1.
With the BIC approximation, the log Bayes Factor becomes
Toyota InfoTechnology Center U.S.A, Inc.
Pairwise BIC comparisons of the reveal large log BF values for EP vs P and smaller values for EEP vs EP
10
Boxplot of BIC comparison for Pareto vs. EP Mixture Model.
Boxplot of BIC comparison for EP vs. EEP Mixture Model.
EP
P
EEP
EP
Toyota InfoTechnology Center U.S.A, Inc.
Model Selection Results
Model selection results based on Bayes Factors, over all users. Each bar represents the same user set with a different binning time window.
For the P, EP, and EEP models -- P: Only a handful of users are given the Pareto-only model, EP: Overall, the EP model is selected for 50-85% of the users, depending upon the bin size, andEEP: Between 15%-40% of user machines are best modeled by EEP, again depending upon the bin size.
11
P
EP
EEP
Toyota InfoTechnology Center U.S.A, Inc.
Histograms of Heavy-Tail Parameters’ Variation, EP Model.
12
• The difference across users is significant.
Toyota InfoTechnology Center U.S.A, Inc.
Partitioning traffic into Exponential and Pareto ranges
Mixture fractions as a function of connections indicate (soft) membership of the data into a component.
In this example, bins with less than 82 counts are almost entirely exponential, and those with greater than 82, almost entirely Pareto.
This way different sources of the traffic can be characterized as heavy-tailed or not.
13
Mixture Fractions, User 256
mPareto
mexp
P(traffic)
Toyota InfoTechnology Center U.S.A, Inc.
Traffic Fractions, in Exponential and Pareto Components, by Protocol
14
Although Exponential traffic dominates in all cases, the long tail (i.e. Pareto) traffic appears largely from bursts of ICMP, DNS and web traffic flows.
Toyota InfoTechnology Center U.S.A, Inc.
In summary
1. We have modeled traffic as flow initiations from end hosts in an enterprise,using mixture models, employing model selection.
2. We have discoveredStrong evidence that the traffic, is almost always heavy-tailed, with the Pareto component contributing about 1/4 of the probability mass. and with power law scaling parameter with mean = 1.6
that varies widely, between 1.0 and 2.0.
3. Apparently DNS, ICMP and some web traffic make up the tail component.
15
http://arxiv.org/abs/1212.2744
See the full paper at
BACKUP
16
Toyota InfoTechnology Center U.S.A, Inc.
Pareto & Exponential components of selected users
17
Toyota InfoTechnology Center U.S.A, Inc. 18
Toyota InfoTechnology Center U.S.A, Inc.
Anomaly thresholds derived from models are more stable than empirical thresholds.
19
Toyota InfoTechnology Center U.S.A, Inc.
Component parameters are independent
20
This implies that the exponential and Pareto components are generated by separate sources.
top related