computer networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf ·...

18
Internet traffic modeling by means of Hidden Markov Models q,qq Alberto Dainotti a , Antonio Pescapé a , Pierluigi Salvo Rossi b,c , Francesco Palmieri c , Giorgio Ventre a, * a Department of Computer Science and Systems, University of Naples ‘‘Federico II, Via Claudio 21, 80125 Napoli, Italy b Department of Electronics and Telecommunications, Norwegian University of Science and Technology, O.S. Bragstads plass 2B, 7491 Trondheim, Norway c Department of Information Engineering, Second University of Naples, Via Roma 29, 81031 Aversa (CE), Italy article info Article history: Received 1 March 2007 Received in revised form 12 December 2007 Accepted 7 May 2008 Available online 28 May 2008 Responsible Editor: M. Smirnow abstract In this work, we propose a Hidden Markov Model for Internet traffic sources at packet level, jointly analyzing Inter Packet Time and Packet Size. We give an analytical basis and the mathematical details regarding the model, and we test the flexibility of the proposed mod- eling approach with real traffic traces related to common Internet services with strong dif- ferences in terms of both applications/users and protocol behavior: SMTP, HTTP, a network game, and an instant messaging platform. The presented experimental analysis shows that, even maintaining a simple structure, the model is able to achieve good results in terms of estimation of statistical parameters and synthetic series generation, taking into account marginal distributions, mutual, and temporal dependencies. Moreover we show how, by exploiting such temporal dependencies, the model is able to perform short-term prediction by observing traffic from real sources. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction Understanding and solving performance-related issues of current and future networks requires the availability of realistic, but still simple and manageable, traffic models. Therefore the modeling of Internet traffic represents a crit- ical task in the study and in the design of Internet architec- tures. Many efforts have focused on modeling source traffic related to specific application-level protocols, also with the purpose of conducting realistic network traffic simulation and emulation experiments (i.e. generating synthetic traf- fic in real networks). Here we present a source-based modeling approach relying on a packet-level view of Internet traffic. The anal- ysis of network traffic, indeed, can be made at different abstraction levels, e.g. session, conversation, connection/ flow, packet, byte. With the term packet-level we mean the characterization of traffic in terms of Inter Packet Time (IPT) and Packet Size (PS). Such approach results particu- larly attractive because of its conciseness, flexibility, and because it allows to look at traffic from the lowest point of view. With the term source-based approach, we mean a traffic characterization and modeling of traffic generated by Internet applications running on single hosts. As for the analytical approach, we adopted a specifically suited Hidden Markov Model (HMM). The idea is to keep the model analytically simple and tractable, but capable to capture important joint dynamics (in terms of both mar- ginal distributions and time dependencies) of IPT and PS. We evaluate the model capabilities (learning, generation, and prediction) in order to construct realistic packet-level 1389-1286/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2008.05.004 q This work has been partially supported by PRIN 2007 RECIPE Project, by CONTENT EU Network of Excellence, and finally from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 216585 (INTERSECTION Project). qq Preliminary results within the same framework of this work have been recently published in [1,2]. * Corresponding author. Tel.: +39 081 7683908; fax: +39 081 7682950. E-mail addresses: [email protected] (A. Dainotti), [email protected] (A. Pescapé), [email protected] (P. Salvo Rossi), francesco.palmieri@ unina2.it (F. Palmieri), [email protected] (G. Ventre). Computer Networks 52 (2008) 2645–2662 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

Computer Networks 52 (2008) 2645–2662

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/ locate/comnet

Internet traffic modeling by means of Hidden Markov Models q,qq

Alberto Dainotti a, Antonio Pescapé a, Pierluigi Salvo Rossi b,c, Francesco Palmieri c,Giorgio Ventre a,*

a Department of Computer Science and Systems, University of Naples ‘‘Federico II”, Via Claudio 21, 80125 Napoli, Italyb Department of Electronics and Telecommunications, Norwegian University of Science and Technology, O.S. Bragstads plass 2B, 7491 Trondheim, Norwayc Department of Information Engineering, Second University of Naples, Via Roma 29, 81031 Aversa (CE), Italy

a r t i c l e i n f o

Article history:Received 1 March 2007Received in revised form 12 December 2007Accepted 7 May 2008Available online 28 May 2008

Responsible Editor: M. Smirnow

1389-1286/$ - see front matter � 2008 Elsevier B.Vdoi:10.1016/j.comnet.2008.05.004

q This work has been partially supported by PRINby CONTENT EU Network of Excellence, and finallCommunity’s Seventh Framework Programme (FPGrant Agreement No. 216585 (INTERSECTION Projeqq Preliminary results within the same frameworbeen recently published in [1,2].

* Corresponding author. Tel.: +39 081 7683908; faE-mail addresses: [email protected] (A. Dainott

(A. Pescapé), [email protected] (P. Salvo Rossi),unina2.it (F. Palmieri), [email protected] (G. Ventre).

a b s t r a c t

In this work, we propose a Hidden Markov Model for Internet traffic sources at packet level,jointly analyzing Inter Packet Time and Packet Size. We give an analytical basis and themathematical details regarding the model, and we test the flexibility of the proposed mod-eling approach with real traffic traces related to common Internet services with strong dif-ferences in terms of both applications/users and protocol behavior: SMTP, HTTP, a networkgame, and an instant messaging platform. The presented experimental analysis shows that,even maintaining a simple structure, the model is able to achieve good results in terms ofestimation of statistical parameters and synthetic series generation, taking into accountmarginal distributions, mutual, and temporal dependencies. Moreover we show how, byexploiting such temporal dependencies, the model is able to perform short-term predictionby observing traffic from real sources.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Understanding and solving performance-related issuesof current and future networks requires the availabilityof realistic, but still simple and manageable, traffic models.Therefore the modeling of Internet traffic represents a crit-ical task in the study and in the design of Internet architec-tures. Many efforts have focused on modeling source trafficrelated to specific application-level protocols, also with thepurpose of conducting realistic network traffic simulation

. All rights reserved.

2007 RECIPE Project,y from the European7/2007-2013) under

ct).k of this work have

x: +39 081 7682950.i), [email protected]@

and emulation experiments (i.e. generating synthetic traf-fic in real networks).

Here we present a source-based modeling approachrelying on a packet-level view of Internet traffic. The anal-ysis of network traffic, indeed, can be made at differentabstraction levels, e.g. session, conversation, connection/flow, packet, byte. With the term packet-level we meanthe characterization of traffic in terms of Inter Packet Time(IPT) and Packet Size (PS). Such approach results particu-larly attractive because of its conciseness, flexibility, andbecause it allows to look at traffic from the lowest pointof view. With the term source-based approach, we meana traffic characterization and modeling of traffic generatedby Internet applications running on single hosts.

As for the analytical approach, we adopted a specificallysuited Hidden Markov Model (HMM). The idea is to keepthe model analytically simple and tractable, but capableto capture important joint dynamics (in terms of both mar-ginal distributions and time dependencies) of IPT and PS.We evaluate the model capabilities (learning, generation,and prediction) in order to construct realistic packet-level

Page 2: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

2646 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

models from the automated analysis of empirical traffictraces, by considering the marginal distributions and theauto- and cross-covariances of IPT and PS. In addiction,we compare synthetically generated sequences of IPT–PSpairs against those from real traces. Also, we show thecapability of the proposed model to predict the short-termfuture behavior of the analyzed traffic on the basis of a verysmall amount of monitored traffic.

An important objective of this work is the design of amodel flexible enough to work with different kinds ofInternet traffic sources. For this reason, we apply our ap-proach to traffic traces of various application-layer proto-cols and related to very different Internet services. Moreprecisely we separately consider traffic generated by (i)SMTP, (ii) HTTP, (iii) a network game, and (iv) an instantmessaging application. The conducted experimental inves-tigation shows that, with a very limited complexity, theproposed model achieves acceptable results. Moreover,the prediction capabilities of the model – tested here inan off-line fashion – let foresee the useful application ofsuch modeling approach for resource reservation andadmission control purposes.

Finally, according to the source-based approach, we donot focus on aggregate link traffic, whereas we separatelyanalyze several sessions of traffic exiting from single hostsrelated to specific application-level protocols. The purposeis to model the average single session for each consideredapplication. The effect of superposition of multiple syn-thetic traffic sources, e.g. the presence of self-similarityand long range dependence in the synthetic aggregatedtraffic, falls beyond the scope of this paper.

The rest of the paper is organized as follows. In Section2, related works and motivations at the basis of this workare given. Section 3 provides an introduction to HMMand a description on their application to build the pro-posed analytical model, furnishing details about model sta-tistics and the learning stage. Section 4 describes themeasurement approach, giving insights and motivationson the specific traffic taken in consideration. In Section 5,we show results of the model applied to SMTP, HTTP, Ageof Mythology (AoM), and MSN Messenger. Section 6 endsthe paper discussing the presented results and giving con-clusion remarks.

2. Motivation and related work

Source traffic models are necessary to reproduce realis-tic user/application behavior in simulative environmentsor in network testbeds by injecting synthetic network traf-fic (e.g. traffic emulation). This allows to study networkarchitectures performance problems by reconstructingthe flows of packets generated by single sources. In thepast, several source models related to HTTP traffic havebeen proposed [3,4], being the dominant Internet applica-tion, whereas only simple statistical characterizations ofsource traffic related to other applications like SMTP, net-work games, etc., have been presented [5,6]. Past yearsthough, have seen a growing heterogeneity of Internetapplications, making necessary the availability of modelsfor different kinds of applications. Here, we explore the

feasibility of a single modeling approach, flexible enoughto work with different categories of sources, to be easilyintegrated into a traffic generation (or simulation) frame-work [7]. Moreover, even if we stress that this was themain focus of the present work, as regards some of the con-sidered traffic categories, as network games, we would liketo notice that this work represents one of the first attemptsto build a thorough statistical model able to take into ac-count multiple properties of the traffic. Indeed, while thereexists a rich literature in terms of traffic characterizationsand sometimes modeling of network games [6,8–11], thisusually focuses on fitting the marginal distributions ofIPT and PS, sometimes by arbitrarily splitting the fittinginto different analytical distributions for different portionsof the sample set; time dependence (a relevant property toconsider when studying traffic modeling and simulation)and mutual dependence are usually not taken into account.

The proposed model relies on HMMs to reproduce traf-fic sources at packet-level. The reason for we focused on apacket-level view of traffic is that it provides the followingbenefits when compared with higher-level approaches: (i)we look at traffic at the deepest level of detail but at thesame time basing the observations on just two variables;(ii) switching devices often operate on a packet-by-packetbasis, therefore it is important to dispose of realistic pack-et-level models to evaluate their performance; (iii) mostnetwork performance problems (e.g. Loss, Delay, Jitter)happen at packet level; (iv) working at packet-level makesour approach independent of protocols evolution andapplicable to different applications/protocols; (v) suchkind of model is usable in traffic generators and simula-tors; (vi) traffic at packet level remains observable afterencryption made by, for example, end-to-end crypto-graphic protocols such as SSL or IPSec; (vii) packet-leveltraffic models make robust approaches to traffic profilingfor anomaly detection.

As far as concerns the analytical modeling approach, wehad to face the trade-off among accuracy (the capability tocapture as much statistical properties of traffic dynamicsas possible), flexibility, and simplicity. The use of HMMs al-lowed us to build an easily tractable model, capable tojointly take into account IPT and PS first order statisticsas well as temporal dynamics and correlation. In spitethe large number of references related to network trafficmodeling, very few works aim at joint modeling of IPTand PS [12–14]. Whereas, it has been demonstrated thatneglecting aspects related to PS (e.g. assuming a constantvalue) significantly affects performance analysis [12]. Cor-relation structure is also a fundamental aspect that mustbe considered [15] when realistic replication of traffic isneeded.

Recently the interest in HMM-based models hasgrown, and HMM models have been proposed as a toolfor several network traffic related research problems. In[16,17] HMM models have been used to model the statesof packet channels via corresponding loss probabilitiesand end-to-end delay distributions. Similar works havebeen proposed to model wired [18] and wireless [19]packet channels. To the best of our knowledge, few mod-eling works using HMMs to model traffic sources at pack-et level are present in literature. Specifically, we found

Page 3: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

1 Notation – Upper (resp. lower) bold case letters denote matrices (resp.column vectors), Ai;j (resp. ai) denotes the ði; jÞth (resp. ith) element ofmatrix A (resp. column vector a), 1 denotes a column vector whoseelements are 1, di;j denotes the delta of Kronecker, ½��T and Ef�g, respectively,denote transpose and expectation operators, the symbol � means ‘‘distrib-uted as”.

2 If the observable variable is continuous, the observable matrix isreplaced with a set of N conditional pdfs, say fB1ðyÞ;B2ðyÞ; . . . ;BNðyÞg.

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2647

approaches to Internet traffic modeling able to capturetemporal structures based on MMPP (Markov ModulatedPoisson Process) [20] and BMAP (Batch Markovian ArrivalProcess) [12,13]. In [20], authors propose a layered modelto replicate traffic at edge routers that takes into accounthierarchical characteristics of Internet traffic as well aslong-range dependence properties, while in [13] efficientimplementation of analytical tractable models for aggre-gate IP traffic is presented focusing on burstiness andself-similarity properties. The BMAP model proposed in[12], which considers both packet IPT and PS, is designedto capture the long-range dependence present in traffictraces of aggregate link traffic and it is evaluated in termsof queue analysis. In our work, instead, we concentrate onthe traffic generated by single sources which is thenmixed in network links, therefore here we do not considerqueue analysis. In [21], a Markov-based model has beenproposed and applied to variable bitrate MPEG traffic;GOP layer traffic characteristics for MPEG video trafficand sources are constructed from MPEG1 encoded videosequences. The same model has been applied to othertraffic types (e.g, VoIP traffic). In [22], HMMs have beenused to disjointly model IPT and PS of both aggregatedand WWW traffic, comparing results against those froma stochastic generator based on a chaotic attractor. Final-ly, in [23], again considering IPT and PS disjointly, HMMshave been used to build traffic classifiers based on packet-level statistics related to some Internet applications. It isworth noticing that the HMM-based modeling approachpresented here is part of a more general framework thatalso includes packet-channels modeling [18,19]. Thelong-term objective is a powerful homogeneous analyticalframework for effective modeling of packet-level environ-ments in heterogeneous scenarios (both in terms of trafficsources and end-to-end network paths).

To highlight and summarize the significance of the ap-proach proposed in this work, we underline that, to thebest of our knowledge, it extends the results present in lit-erature in that

� it allows IPT/PS joint description;� it allows synthetic series generation of both IPT and PS;� it allows source state estimation with traffic prediction;� it is derived by real traffic traces;� it has been tested on different traffic types (quite differ-

ent from each other in terms of both used protocols andusers/applications behavior), deriving analogies and dif-ferences on the equivalent traffic models;

� results obtained with the analyzed traffic categoriesshow the flexibility of the proposed model making itgeneralizable;

� as regards games traffic, this represents one of the firstworks to present a more complete model taking intoaccount the aforementioned statistical properties.

Finally, we would like to underline that at [24], wemake publicly available the open-source tool (called Plab)used for traffic analysis and measurement at packet-level,the algorithms developed for the analytical model, and, fi-nally, the large set of heterogeneous data/traffic tracesused in this work.

3. The model

3.1. Hidden Markov Models

We propose a statistical model1 for packet-level net-work traffic. More specifically, we model the single sourceof traffic as an HMM. Generally speaking, an HMM may beviewed as a probabilistic function of a (hidden) Markovchain [25], thus it is composed of 2 variables:

� the hidden-state variable, whose temporal evolution fol-lows a Markov-chain behavior;

� the observable variable, that stochastically depends onthe hidden state.

Its topology is shown in Fig. 1, where xn 2 fs1; . . . ; sNgand yn 2 fo1; . . . ; oMg represent the state and the observa-ble at discrete time n, respectively, with N and M beingthe number of states and the number of observable,respectively. An HMM is characterized by

� u – the initial state distribution, where ui ¼ Prðx1 ¼ siÞ;� A – the N � N state transition matrix, where Ai;j ¼

Prðxn ¼ sjjxn�1 ¼ siÞ;� B – the N �M observable generation matrix,2 where

Bi;j ¼ Prðyn ¼ ojjxn ¼ siÞ.

We denote k ¼ fu;A;Bg the complete set of parameters.The three fundamental problems of an HMM are

� evaluation – given a model k and a sequence of observa-tions y ¼ ðy1; . . . ; yLÞ, compute efficiently the probabilityof the sequence given the model, PrðyjkÞ. It is solved viathe forward–backward algorithm.

� reconstruction – given a model k and a sequence ofobservations y ¼ ðy1; . . . ; yLÞ, find the most likely corre-sponding sequence of states x ¼ ðx1; . . . ; xLÞ. It is solvedvia the Viterbi algorithm, a dynamic programming tech-nique performing computation of the best score andtracking variables.

� learning – given a sequence of observations y ¼ðy1; . . . ; yLÞ, find the set of parameters k such that thelikelihood of the model Lðy; kÞ ¼ PrðyjkÞ is maximum.It is solved via the Baum–Welch algorithm, a specialcase of the Expectation–Maximization algorithm [26],that iteratively updates the parameters in order to finda local maximum point of the parameter set.

It is worth noticing that the recursive computation ofthe forward and backward variables presents a complexityoðLN2Þ with respect to the complexity oðLNLÞ of directcalculation, with L being the length of the sequence of

Page 4: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

Fig. 1. Hidden Markov Model topology.

2648 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

observations. For a more comprehensive discussion onHMMs refer to [25,27,28].

3.2. A packet-level source model

Referring to a single source of traffic, we consider anHMM in which the state variable is discrete, xn 2fs1; . . . ; sNg, and the observable variable is a continuousbi-dimensional vector, yn ¼ ½dn; bn�T. The first and secondcomponents of yn represent the IPT and the PS for thenth packet, respectively in dBl (which we define as10log10ðIPT=1 lsÞ) and in bytes.3 The state variable hasbeen introduced to account for memory and correlationphenomena between IPT and PS. We assumed that IPT andPS are statistically independent given the state. Also, inorder to reduce the number of parameters, we assumeu ¼ q, where q is the steady-state distribution,4 given byATq ¼ q.

K ¼ fA;gðtÞ;wðtÞ;gðpÞ;wðpÞg is the set of parameterscharacterizing the model, denoting the state transitionmatrix, the conditional IPT and PS distribution vectors,respectively, i.e.

� Ai;j ¼ Prðxnþ1 ¼ sjjxn ¼ siÞ;� dnjxn ¼ si � CðgðtÞi ;w

ðtÞi Þ;

� bnjxn ¼ si � CðgðpÞi ;wðpÞi Þ;

then the conditional pdfs for IPT and PS are:

f ðtÞi ðdÞ ¼ðd=wðtÞi Þ

gðtÞi�1 e�ðd=wðtÞ

wðtÞi CðgðtÞi Þðd > 0Þ;

f ðpÞi ðbÞ ¼ðb=wðpÞi Þ

gðpÞi�1 e�ðb=wðpÞ

wðpÞi CðgðpÞi Þðb > 0Þ:

The choice of Gamma distributions for IPT and PS is be-cause a mixture of normal distributions can easily approx-imate a general distribution, Gamma is practically verysimilar to a normal distribution and has the desirable char-acteristic to be null for negative values (being negative IPT

3 We measure IPT with a resolution of 1 ls (as explained in Section 4)and apply a logarithmic transformation because they range over severalorders of magnitude.

4 If xn is an irreducible and aperiodic process, the steady-state distribu-tion equals the limit distribution, qi ¼ limn!1fPrðxn ¼ siÞg, see [29].

and PS meaningless). Summarizing we have a model wherexn is a discrete random variable whose dynamic behavior isgoverned by the transition matrix A, with a Markovianassumption for the evolution, and yn is a bi-dimensionalcontinuous random variable describing IPT and PS as mix-tures of conditionally independent (given the state) Gam-ma distributions, i.e.

fiðynÞ ¼ f ðtÞi ðdnÞf ðpÞi ðbnÞ: ð1Þ

3.2.1. Model statisticsThe IPT and PS conditional means and standard devia-

tions are

lðtÞi ¼ gðtÞi wðtÞi ; rðtÞi ¼ffiffiffiffiffiffiffigðtÞi

qwðtÞi ; lðpÞi ¼ gðpÞi wðpÞi ;

rðpÞi ¼ffiffiffiffiffiffiffiffigðpÞi

qwðpÞi ; ð2Þ

respectively, due to the conditional Gamma distributionassumption, then the IPT and PS global means and stan-dard deviations of the model are

lðtÞ ¼XN

i¼1

qilðtÞi ; rðtÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN

i¼1

qilðtÞi ð1þ gðtÞi Þw

ðtÞi � ðlðtÞÞ

2

vuut ;

lðpÞ ¼XN

i¼1

qilðpÞi ; rðpÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXN

i¼1

qilðpÞi ð1þ gðpÞi Þw

ðpÞi � ðlðpÞÞ

2

vuut :

ð3Þ

Also, global IPT and PS pdfs are

fIPTðdÞ ¼XN

i¼1

qifðtÞi ðdÞ; f PSðbÞ ¼

XN

i¼1

qifðpÞi ðbÞ:

The conditional (given that state) duration in the state si is

/i ¼1

1� Ai;i: ð4Þ

IPT and PS auto- and cross-correlations of the model are

RðtÞðmÞ ¼ Efdndnþmg ¼qTEðtÞII 1; m ¼ 0;

qTEðtÞAjmj�1EðtÞ1; m 6¼ 0;

(

RðpÞðmÞ ¼ Efbnbnþmg ¼qTEðpÞII 1; m ¼ 0;

qTEðpÞAjmj�1EðpÞ1; m 6¼ 0:

(

RðtpÞðmÞ ¼ Efdnbnþmg ¼qTEðtpÞII 1; m ¼ 0;

qTEðtÞAjmj�1EðpÞ1; m 6¼ 0:

(

where

EðtÞIIi;j ¼ Ai;ið1þ gðtÞi ÞgðtÞi ðw

ðtÞi Þ

2di;j; EðtÞi;j ¼ Ai;igðtÞi wðtÞi di;j;

EðpÞIIi;j ¼ Ai;ið1þ gðpÞi ÞgðpÞi ðw

ðpÞi Þ

2di;j; EðpÞi;j ¼ Ai;igðpÞi wðpÞi di;j;

EðtpÞIIi;j ¼ Ai;igðtÞi wðtÞi gðpÞi wðpÞi di;j:

It is worth noticing that E and EII are first-order and sec-ond-order statistics matrices. To show traffic dynamicswithout the biasing effects of IPT and PS global means, inSection 5 covariances are taken into account instead ofcorrelations.

Page 5: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2649

3.2.2. Learning the model parametersThe Expectation–Maximization algorithm [26] is an

optimization procedure that allows learning of a new setof parameters for a stochastic model according to improve-ments of the likelihood of a given sequence of observablevariables. For structures like HMM’s this optimizationtechnique reduces to the Baum–Welch algorithm[25,27,28], studied for discrete and continuous observablevariables with a broad class of allowed conditional pdfs.The Baum–Welch algorithm is an iterative procedure look-ing for a local maximum of the likelihood function whichtypically depends on the starting point K. When necessary,multiple trainings with different initial conditions providethe global solution.

More specifically, consider a set of observable se-quences Y ¼ fYð1Þ; . . . ;YðKÞg referred to as the training set,where each sequence YðkÞ ¼ ½yðkÞ1 ; . . . ; yðkÞLk

� represents IPTand PS from a single session.5 We want to find the set ofparameters such that the likelihood LðY; KÞ ¼ PrðYjKÞ ofthe training set is maximum. The Baum–Welch for the pro-posed source-traffic model is then based on the followingequations:

Ai;j ¼PK

k¼11

LðkÞ

PLk�1n¼1 aðkÞn ðiÞAi;jfjðyðkÞnþ1Þb

ðkÞnþ1ðjÞPK

k¼11

LðkÞ

PLk�1n¼1 aðkÞn ðiÞbðkÞn ðiÞ

;

gðtÞi wðtÞi ¼PK

k¼11

LðkÞ

PLkn¼1a

ðkÞn ðiÞbðkÞn ðiÞd

ðkÞnPK

k¼11

LðkÞ

PLk�1n¼1 aðkÞn ðiÞbðkÞn ðiÞ

;

gðpÞi wðpÞi ¼PK

k¼11

LðkÞ

PLkn¼1a

ðkÞn ðiÞbðkÞn ðiÞb

ðkÞnPK

k¼11

LðkÞ

PLk�1n¼1 aðkÞn ðiÞbðkÞn ðiÞ

;

gðtÞi ðwðtÞi Þ

2 ¼PK

k¼11

LðkÞ

PLkn¼1a

ðkÞn ðiÞbðkÞn ðiÞðd

ðkÞn � lðtÞi Þ

2PKk¼1

1LðkÞ

PLk�1n¼1 aðkÞn ðiÞbðkÞn ðiÞ

;

gðpÞi ðwðpÞi Þ

2 ¼PK

k¼11

LðkÞ

PLkn¼1a

ðkÞn ðiÞbðkÞn ðiÞðb

ðkÞn � lðpÞi Þ

2PKk¼1

1LðkÞ

PLk�1n¼1 aðkÞn ðiÞbðkÞn ðiÞ

;

where referring to the kth sequence, the likelihood is

LðkÞ ¼ PrðYðkÞjKÞ ¼XN

i¼1

aðkÞn ðiÞbðkÞn ðiÞ;

and the Forward and Backward variables are computedaccording to the following recursions:

aðkÞn ðjÞ ¼PN�1

i¼0aðkÞn�1ðiÞAi;jfjðyðkÞn Þ; n ¼ 1; . . . ; Lk;

d1;j; n ¼ 0;

8><>:

bðkÞn ðiÞ ¼PN�1

j¼0Ai;jfjðyðkÞnþ1Þb

ðkÞnþ1ðjÞ; n ¼ 0; . . . ; Lk � 1;

1; n ¼ Lk:

8><>:

In our experiments the initialization for the parameter setK, has been such to have the conditional pdfs, for both IPTand PS, uniformly distributed on the whole observed range.More specifically, the state-transition matrix is given by

Ai;j ¼ 1=N; ð5Þ

5 The meaning of ‘‘session” will be better defined in Section 4.

while denoting

dmin ¼ mink;n

dðkÞn ; dmax ¼maxk;n

dðkÞn ;

bmin ¼mink;n

bðkÞn ; bmax ¼maxk;n

bðkÞn ;

then fgðtÞ;wðtÞ;gðpÞ;wðpÞg are chosen as

lðtÞiþ1 � lðtÞi ¼dmax � dmin

N þ 1; rðtÞi ¼

dmax � dmin

5ðN þ 1Þ ;

lðpÞiþ1 � lðpÞi ¼bmax � bmin

N þ 1; rðpÞi ¼

bmax � bmin

5ðN þ 1Þ :

ð6Þ

4. Traffic traces and measurement approach

4.1. Considered traffic

To verify its flexibility and general applicability, theproposed modeling approach has been tested with differ-ent categories of Internet traffic sources. The choice of suchapplications takes into account the level of novelty andpopularity. Also, we considered applications differing fromseveral points of views which all reflect into traffic pecu-liarities: man–computer interaction, transferred objects,underlying network protocols, etc.

The list of considered Internet applications, along withdetails of the corresponding traffic traces we used,6 is re-ported in Table 1.

Firstly, we considered more traditional services as theWeb and the Email. Although HTTP and SMTP are twoapplications largely involving all the Internet population(the most used by common users), they substantially differfor the kinds of treated objects as well as the level of userinteraction. The characteristics of traffic generated by HTTPclients can be heavily affected by the human factor, aboveall as regards timings [2,3], whereas SMTP clients traffic isaffected by users mostly in terms of the number and size ofpackets to be transferred.

Secondly, we considered applications which have be-come popular in the recent years and currently representan increasing portion of the overall Internet traffic: instantmessaging and multi-player network games. They bothpresent novel and interesting characteristics with respectto other applications. Due to these differences, as for bothgames and instant messaging, the interest in the character-ization and modeling of their traffic is increased in the lastyears [6,11,31,32].

Network games have strict latency requirements andtraffic properties which substantially differ from more tra-ditional Internet applications [6]. Moreover, while theirtraffic represents a relevant percentage yet – in [33] itwas reported that about 4% of all packets in a backbonecould be associated with only six popular network games– it is constantly increasing. Thus, analysis of such trafficis crucial to properly design and provision networks for fu-ture needs. We studied traffic generated by Age of Mythol-ogy (AoM), a Microsoft Real Time Strategy Multiplayer

6 Apart the AoM traffic traces available at [30], they are freely available at[24].

Page 6: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

Table 1Traffic traces details

Traffic Link Protocol Port Date Size Pkts Sessions

SMTP WAN TCP 25 9/2005 3 GB 43 M 56 KHTTP WAN TCP 80 7/2004 60 GB 830 M 1 MAoM LAN UDP 2300 8/2003 12 MB 180 K 6MSN WAN TCP 1863 4/2006 1 GB 9 M 1 M

2650 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

Game [34]. As regards Instant Messengers, they are used by50% of the Internet users all around the world [35], beingMSN Messenger the most popular application, followedby AOL and Yahoo Messenger. In this work we model thetraffic generated by MSN Messenger (MSN in the follow-ing) clients [36]. The level of user interaction in these kindof applications is obviously much higher. Moreover, be-cause they represent a new vehicle of viruses, worms,and of other kinds of malicious use, the study of instantmessaging applications, besides email, has also interestingsecurity implications. For example, the traffic behaviorcharacterization and modeling of such applications couldbe exploited for security purposes (classification, detec-tion, prevention).

In Table 1 details about the traffic traces that weanalyzed are given. As regards SMTP, HTTP, and MSN, wecaptured traffic by passively monitoring the WAN accesslink at University of Napoli ‘‘Federico II” network duringthe period January 2004–April 2006. The observed link rep-resents the only connection of the University network tothe Internet, and it has a maximum throughput equal to200 Mbps.

With the term ‘‘session”, in the case of SMTP (resp.HTTP), we mean all the traffic exchanged between twohosts related to port TCP 25 (resp. TCP 80), with a timeoutof 15 min. As regards SMTP, we present results from thesessions with less than 100 packets, which we defined asshort-lived, and which account for � 97% of the SMTP ses-sions. This is because we found that there are other ses-sions which exhibit extremely different statisticalproperties. This was confirmed by a K-means clusteringwe performed using a few features per session, e.g. numberof packets, bytes, IPT and PS mean and variance. Note thatconsidering only this class does not affect our approach, aswe do not want to provide a comprehensive model forSMTP traffic. At this stage we want to show the applicabil-ity of the proposed approach also to this kind of traffic. Asregards MSN, the MSN protocol, uses a client–server com-munication model in which user clients interact withMicrosoft servers that belong to the MSN Messenger net-work and which accept connections on TCP port 1863

Capture Trace inspection

Trace sanitization anonymizatio

Fig. 2. Life cycle of

[37]. There are mainly two kinds of servers, which offerservices of presence and instant messaging, respectively,[38]. Analysis of both communication protocol and realtraffic traces allowed us to identify the subnets associatedto each service. We collected traffic related to both servicesand both directions (inbound and outbound). In this workwe report results related only to the outbound direction(i.e. from the clients to the servers) of instant messagingtraffic. Each session is made of all the traffic exchanged be-tween a single client–server pair (related to server port TCP1863), with an inactivity timeout of 15 min. The AoMtraces, instead, have been provided, in Tcpdump format,by the Worcester Polytechnic Institute (WPI), MA (USA)[30]. They consist of packet sequences of complete gamingsessions, between two players, captured in a LAN environ-ment. We consider an AoM session given by all the trafficexchanged from the beginning to the end of a match. Onlysix gaming sessions were studied, because packet-leveltraffic of RTS games has been demonstrated being verypredictable and strongly dependent from the specific gameapplication whereas it is poorly dependent from userbehavior [39]. Indeed, past works studying the statisticalcharacterization of the traffic generated by this game haveused only such traces. Such works show that this traffic issubstantially different from traffic of more classical net-work applications. Moreover, in [40] we showed resultsand commented regarding the invariance of gaming trafficwhen observed under different situations, which makesreasonable the use of a small number of traces. As regardsSMTP, HTTP, MSN traces, instead, we observed a much lar-ger set of sessions. This is because of the more complexnature of such traffic [2] and also because we could gatherour own traces.

4.2. Tools and issues with the data

Obtaining and making available traffic data useful forcharacterization and modeling is a complex task, whichnot only consists into traffic collection and selection ofthe appropriate traffic flows, but it also involves activitiessuch as data sanitization and anonymization (see Fig. 2).

We used Plab [2,24] to capture the traffic traces we col-lected and analyzed. Plab is an open-source software, par-tially based on the Libpcap library [41], that we developedfor the analysis of live traffic and of file traces in tcpdumpformat, and focused on packet-level measurement andanalysis. This platform, employed also in previous workson traffic analysis and modeling, is capable to efficientlyanalyze very large traffic traces and to separate traffic intodifferent sessions. Depending on user-specified parame-

& n

Measurements and preliminary

analysis

Data analysis

data analysis.

Page 7: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

0 1 2 3 4 5 6 7 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

log10(x); x = 1E–5s

sanitized

unsanitized

0 500 1000 15000

0.2

0.4

0.6

0.8

1

bytes

sanitized

unsanitized

Fig. 3. Filtered UNINA upstream: IPT PDF (top), PS CDF (bottom).

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2651

ters, a session can be identified by: (i) all packets sent andreceived by a host (host mode); (ii) all packets identified bysource and destination IP and ports with a default timeoutof 60 s (flow mode); (iii) all packets exchanged by 2 hostsrelated to a specific service (e.g. TCP port 80), with a userdefinable timeout (conversation mode). Given one of theabove modes, sessions are assigned an ID, and for each ses-sion IPTs between packets flowing in the same directionare calculated, along with PS. We call such data packet-le-vel data series.

With Plab it is possible to specify command line filtersin tcpdump/Berkeley Packet Filter syntax to select the typeof traffic to be captured or analyzed, e.g. layer 3 protocol,port, etc. Also, more intelligence was introduced into thesoftware, as the ability to decode optional TCP headers likethe MSS, or to filter packets or entire sessions based onseveral others criteria. For example, we introduced somepayload inspection capabilities which served for datasanitization.

As regards sanitization, indeed, here we report on howwe removed, from the considered data, samples relatedto traffic which was not HTTP, but tried to masqueradeas it by running on TCP port 80. We instructed Plab to ana-lyze the first 3 bytes of payload data exchanged betweeneach host pair of a conversation related to TCP port 80. Un-der normal conditions, such bytes should correspond to themethod invoked by the client in a HTTP request. As re-ported in Table 2, we observed that almost 94% of the ses-sions started with a GET request, 4% with a POST request,etc. Only a small fraction of the sessions presented packetsstarting with a byte not corresponding to an alphabeticcharacter. Inside this category, 99% of the conversationsstarted with the byte 0xe3, the first byte exchanged bypeers opening a communication session based on theeDonkey2000 protocol [42], used by eMule and eDonkeyfile-sharing applications. Also, 0.44% of the sessions wereinitiated by the host communicating from port TCP 80(labeled as ‘‘downstream” in Table 2). Because our interestwas in modeling traffic generated only by applicationsrunning over HTTP, we instructed Plab to recognize suchsessions and to filter them out. By filtering our traces, weobserved that 5.12% of the processed packets were dis-carded. Therefore, this non-HTTP traffic represents a notnegligible portion of the captured traffic. As regards thenumber of filtered sessions, they account for about 0.7%of the total. This suggests that the filtered sessions tendto generate more packets than authentic HTTP ones. Bycomparing the results obtained with and without filteringsuch sessions, we observed that discarded traffic had aconsistent impact in terms of payload size and inter-packettime. Comparisons of the obtained distributions for up-stream traffic at the UNINA site are shown in Fig. 3.

Table 2Payload inspection on the first packet opening a conversation related to TCPport 80

Conversationstart

GET POS HEA Downstream 0xe3 PRO

Percentage 93.94 4.23 0.7 0.44 0.27 0.2

Observing the properties of such distributions it is clearthat the filtered sessions increase the portion of back-to-back packets with full payload, probably due to the pres-ence of file-transfers. As reported in [2], after filtering outsuch traffic from the traces captured at the UNINA site,we found packet-level profiles strongly similar to those ob-tained by observing traffic at another site in which no Peer-to-Peer applications were running. This is not only a confir-mation of the correct sanitization we performed, but alsorevealed important invariants (with respect to space andtime) of the characteristics of the studied traffic.

The above example shows how acquiring realistic andreliable data to be used as a reference for traffic modelingis a delicate and sometimes not straightforward task,which requires attention and appropriate tools.

Finally, as regards data anonymization, to preserveusers privacy we kept only the IP and TCP headers of eachpacket, and we scrambled IP addresses using the wide-tcpdpriv tool from the MAWI-WIDE project [43].

Page 8: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

1 2 3 4 5 6 7 8 9 10—102

—101

—100

iteration

log

—lik

elih

ood/

aver

age

leng

th

AoMSMTPHTTPMSN

Fig. 4. Log-likelihood (normalized with respect to the average length ofthe session) vs. iteration.

Table 3Model discrepancy: IPT-k2 and PS-k2

IPT-k2 PS-k2

SMTPStarting 2:1� 108 3:3� 1046

Trained 1.4 4.6

HTTPStarting 1:7� 106 1:1� 1044

Trained 0.31 0.30

AoMStarting 0:99� 102 2:2� 1016

Trained 0.24 1.6

MSNStarting 1:7� 102 1:3� 1047

Trained 0.68 1.8

Table 4Covariance EF: IPT-K, IPT-m, PS-K, PS-m, IPT/PS-K and IPT/PS-m

IPT-K IPT-m PS-K PS-m IPT/PS-K IPT/PS-m

SMTPData 1.0 1.2 1.0 0.91 �0.31 0.46Starting 1.0 50 1.0 50 1.0 50Trained 1.0 0.43 1.0 0.25 �0.54 0.18

HTTP

2652 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

4.3. Measurement methodology and analyzed data

For each session between two hosts, depending on theirdirection, two separate flows of data can be identified,which we called upstream and downstream. In the case ofSMTP, HTTP, and MSN, we identify as upstream the trafficflowing from a client to a server, that is, packets with des-tination port set to respectively TCP 80, TCP 25, and TCP1863. Whereas downstream traffic is related to the oppo-site direction. In the case of SMTP, as regards downstreamtraffic, it is worth mentioning that the vast majority ofdownstream flows – for each session – are made of onlyfew packets (about 5) of small size. Thus they represent avery small portion of SMTP traffic. This can explained bySMTP protocol specifications: the peer acting as a serverusually answers to requests and data transfers from the cli-ent with small messages that must have a numeric ID pre-pended. As for HTTP instead, strong volumes of traffic aregenerated in both directions, this is due to the intrinsicnature of the Web traffic.

In this paper, we concentrate on the traffic sources rep-resented by SMTP, HTTP, and MSN clients, we thereforemodel only upstream traffic. We adopt the same approachfor AoM, modeling the traffic flowing in the outbounddirection when seen from the point of view of a specificpeer (i.e. leaving the workstation of a gaming user). Any-way, being the observed AoM traces related to matcheswith two players, the traffic flowing in the other directionis almost symmetrical.

An important aspect of our methodology is that in theevaluation of IPT and PS distributions we did not take intoaccount packets with empty payload at transport level.Since we wanted to characterize the traffic generated bythe applications, independently as much as possible ofthe transport level protocol itself, we decided to dropall TCP-specific traffic, like connection establishmentpackets (SYN-ACK-SYNACK) and pure acknowledgmentpackets [44]. For the same reason, in the estimation ofthe packet size, we measured the byte length of the TCPpayload or, in the case of AoM we considered the UDP pay-load. These choices make our results usable for simulationpurposes as an input for TCP state machines and UDP/IPstacks, like in D-ITG [24] and TCPlib [45].

As regards the time resolution of the measurements, thepacket timestamping resolution provided by the Libpcaplibrary (which is used both by Tcpdump and Plab), andby the kernel drivers that it links to, is of 1 ls. Moreover,because of the wide range of the considered IPTs, as re-ported in Section 3, we applied a logarithmic transforma-tion to the measured values, 10log10ðIPT=1 lsÞ, which wewill refer to as dBl.

Data 1.0 0.75 1.0 0.63 0.16 1.1Starting 1.0 50 1.0 50 1.0 50Trained 1.0 1.8 1.0 0.29 0.17 0.98

AoMData 1.0 42 1.0 42 �0.10 0.086Starting 1.0 50 1.0 50 1.0 50Trained 1.0 47 1.0 47 �0.17 2.4

MSNData 1.0 1.0 1.0 0.49 �0.38 0.19Starting 1.0 50 1.0 50 1.0 50Trained 1.0 0.42 1.0 0.21 �0.59 0.17

5. Experimental results

This section presents some results of our model when itis applied to SMTP, HTTP, AoM, and MSN traffic.

We used the model with N ¼ 4 states for AoM and MSNtraffic and N ¼ 5 states for SMTP and HTTP traffic, due totheir more complex structure. Choices N ¼ 4 and N ¼ 5have been found effective empirically, as they provided a

sufficient number of modes to capture traffic behavior forthe considered applications. In our experiments modelswith smaller number of states failed capturing the correct

Page 9: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2653

behavior, with some modes missing and/or correlationmismatches. Also, we do not explore models with largernumber of states as increasing N provides a twofold nega-

0 10 20 30 40 50 60 70 80 900

0.01

0.02

0.03

0.04

0.05

0.06

dBμ

IPT

pdf

datastartingtrained

0 10 20 30 40 50 60 70 80 900

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

dBμ

IPT

pdf

datastartingtrained

10 20 30 40 50 60 700

0.02

0.04

0.06

0.08

0.1

0.12

dBμ

IPT

pdf

datastartingtrained

0 10 20 30 40 50 60 70 80 900

0.01

0.02

0.03

0.04

0.05

0.06

0.07

dBμ

IPT

pdf

datastartingtrained

Fig. 5. Histogram and pdf for

tive effect: (i) increased computational complexity, affect-ing learning, monitoring, generation; (ii) more-likely‘‘overfitting” problems, affecting prediction. It is worth

0 500 1000 15000

0.005

0.01

0.015

0.02

0.025

0.03

bytes

PS p

df

datastartingtrained

0 500 1000 15000

0.5

1

1.5

2

2.5

3x 10—3

bytes

PS p

df

datastartingtrained

0 20 40 60 80 100 120 1400

0.02

0.04

0.06

0.08

0.1

0.12

bytes

PS p

df

datastartingtrained

0 500 1000 15000

1

2

3

4

5

6x 10—3

bytes

PS p

df

datastartingtrained

IPT (left) and PS (right).

Page 10: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

2654 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

noticing that this section aims to provide: (i) the effective-ness of HMM’s in traffic modeling at traffic level and (ii)specific HMM’s for the 4 considered traffic typologies.

For all traffic typologies the learning algorithm con-verged in terms of likelihood after a few iterations. Fig. 4shows the log-likelihood normalized with respect to the

0 5 10 15 20

–0.5

0

0.5

1between IPT and IPT

Lag

cova

rianc

e

datastartingtrained

0 5 10

–0.5

0

0.5

1between P

La

cova

rianc

e

0 5 10 15 20

–0.5

0

0.5

1between IPT and IPT

Lag

cova

rianc

e

0 5 10

–0.5

0

0.5

1between P

La

cova

rianc

e

0 5 10 15 20

–0.5

0

0.5

1between IPT and IPT

Lag

cova

rianc

e

0 5 10

–0.5

0

0.5

1between P

La

cova

rianc

e

0 5 10 15 20

–0.5

0

0.5

1between IPT and IPT

Lag

cova

rianc

e

0 5 10

–0.5

0

0.5

1between P

La

cova

rianc

e

datastartingtrained

datastartingtrained

datastartingtrained

Fig. 6. Auto- and cross-cova

average length of the sessions. The reason behind normal-ization is because logðLðY; KÞÞ is decreasing with thelength of the sequence Y.

In the following, we denote ‘‘starting” model the uni-formly distributed initialization for the parameters, while‘‘trained” model, the parameters obtained after 10 itera-

15 20

S and PS

g0 5 10 15 20

–0.5

0

0.5

1between IPT and PS

Lag

cova

rianc

e

15 20

S and PS

g0 5 10 15 20

–0.5

0

0.5

1between IPT and PS

Lag

cova

rianc

e

15 20

S and PS

g0 5 10 15 20

0.5

0

0.5

1between IPT and PS

Lag

cova

rianc

e

15 20

S and PS

g0 5 10 15 20

–0.5

0

0.5

1between IPT and PS

Lag

cova

rianc

e

datastartingtrained

datastartingtrained

datastartingtrained

datastartingtrained

datastartingtrained

datastartingtrained

datastartingtrained

datastartingtrained

riance for IPT and PS.

Page 11: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

Table 5Global statistics: traffic, model, global mean IPT (dBl), global mean PS(bytes), IPT global st. dev. (dBl), PS global st. dev. (bytes)

lðtÞ lðpÞ rðtÞ rðpÞ

SMTPData 40 710 19 619Starting 48 731 20 347Trained 43 688 18 647

HTTPData 49 542 15 324Starting 49 731 19 347Trained 56 541 17 348

AoMData 47 12 8 4Starting 42 69 11 30Trained 47 12 8 4

MSNData 56 557 16 570Starting 48 731 19 331Trained 58 511 15 561

Table 7HTTP conditional statistics: state, steady-state probability, conditionalmean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl),PS conditional st. dev. (bytes), conditional duration

si qi lðtÞi lðpÞi rðtÞi rðpÞi /i

s1 0.153 41 314 23 344 3s2 0.356 54 345 14 70 27s3 0.297 64 530 13 104 19s4 0.111 59 880 15 195 11s5 0.083 60 1387 15 77 3

Table 8AoM conditional statistics: state, steady-state probability, conditional meanIPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl), PSconditional st. dev. (bytes), conditional duration

si qi lðtÞi lðpÞi rðtÞi rðpÞi /i

s1 0.881 49 12 3 2 8s2 0.100 27 15 5 7 1s3 0.013 48 31 3 9 1s4 0.006 52 25 1 8 1

Table 9MSN conditional statistics: state, steady-state probability, conditionalmean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev.(dBl), PS conditional st. dev. (bytes), conditional duration

si qi lðtÞi lðpÞi rðtÞi rðpÞi /i

s1 0.578 66 104 8 73 7s2 0.092 44 437 17 299 1s3 0.060 61 655 5 193 1s4 0.270 45 1378 14 62 4

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2655

tions of the Baum–Welch algorithm, the training set willsimply be referred to as ‘‘data”.

Table 3 compares the model discrepancy, for both thestarting and the trained models, with respect to data viathe parameter k2, commonly used in order to evaluate fit-ting distributions [48]. Table 4 compares the amplitudeand decay parameters (K and m) for the covariances of data,starting model, and trained models when an exponentialfitting (EF), with a minimum mean square error criterion,has been applied; that is, covariances are described in theform K expð�mlagÞ. More specifically, the amplitudeparameter is considered fixed (K ¼ 1) for the auto-covari-ances, whereas it is a free parameter for the cross-covari-ances. Both tables show that the model, even if workingin a jointly fashion, is able to fit with a good accuracy bothmarginal distributions and covariances. Figs. 5 and 6, ana-lyzed in the following, will confirm graphically the resultsof both Tables 3 and 4. More specifically, Table 5 comparesthe global means and standard deviations for the startingand trained models with the data: the trained models exhi-bit good results in terms of mean and standard deviationfor each considered traffic.

The starting values have been set via Eqs. (5) and (6) inSection 3, and are useful to show how, in few iterations, themodel converges to values close to the empirical data. Suchglobal statistics are obtained as weighted averages of eachstate conditional statistics, Eq. (3), whereas we will com-ment the behaviors of the single states in the following

Table 6SMTP conditional statistics: state, steady-state probability, conditionalmean IPT (dBl), conditional mean PS (bytes), IPT conditional st. dev. (dBl),PS conditional st. dev. (bytes), conditional duration

si qi lðtÞi lðpÞi rðtÞi rðpÞi /i

s1 0.447 57 48 8 134 10s2 0.109 18 542 14 269 1s3 0.176 31 1344 10 140 4s4 0.057 51 1107 7 204 1s5 0.211 35 1458 13 7 2

subsection, where different comparisons among data,starting and trained models, in terms of global pdfs (seeFig. 5), auto- and cross-covariance (see Fig. 6) are made.

Some more considerations on the single modes discov-ered by the model are made looking at the conditional sta-tistics (see Tables 6–9). In addition we made the trainedmodels generate output variables to compare the synthet-ically generated IPT–PS pairs with those from real data (seeFig. 7). Finally the trained models are investigated in termsof prediction capabilities (see Figs. 10, 8, 9, 11).

5.1. Model construction and validation

5.1.1. SMTP trafficFig. 5 shows how SMTP traffic presents two main modes

in the IPT distribution, separated by three orders of magni-tude, and two main modes in the PS distribution, essen-tially made of very small packets and full payloadpackets respectively. More specifically, it is apparent fromTable 6 that s1 is responsible for transmission with largeIPT and small PS, while s3 and s5 are responsible for trans-mission with small IPT and large PS. These three dominantmodes account approximately for 85% (from steady statedistributions) of the behavior of SMTP that mainly alter-nates two phases: s1 with low bitrate and s3 and s5 withhigh bitrate. States s2 and s4 can be viewed as transientstates to switch between these two modes.

Page 12: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

0 1000 2000 3000 4000 50000

50

100

IPT sequence

sequence number

0 1000 2000 3000 4000 50000

500

1000

1500

PS sequencePS

(byt

es)

sequence number

0 1000 2000 3000 4000 50000

50

100

IPT sequence

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1μs)

sequence number

0 1000 2000 3000 4000 50000

500

1000

1500

PS sequence

PS (b

ytes

)

sequence number

0 2000 4000 6000 8000 100000

50

100IPT sequence

sequence number

0 2000 4000 6000 8000 100000

500

1000

1500PS sequence

PS (b

ytes

)

sequence number

0 1000 2000 3000 4000 50000

50

100IPT sequence

sequence number

0 1000 2000 3000 4000 50000

500

1000

1500PS sequence

PS (b

ytes

)

sequence number

0 1000 2000 3000 4000 5000

2030405060

IPT sequence

sequence number

0 1000 2000 3000 4000 50000

50

100

PS sequence

PS (b

ytes

)

sequence number

0 1000 2000 3000 4000 50002030405060

IPT sequence

sequence number

0 1000 2000 3000 4000 50000

50

100

PS sequence

PS (b

ytes

)

sequence number

0 2000 4000 6000 8000 10000

20

40

60

80

IPT sequence

sequence number

0 2000 4000 6000 8000 100000

500

1000

PS sequence

PS (b

ytes

)

sequence number

0 2000 4000 6000 8000 100000

20

40

60

80

IPT sequence

sequence number

0 2000 4000 6000 8000 100000

500

1000

1500PS sequence

PS (b

ytes

)

sequence number

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1μs)

10 lo

g 10(

IPT/

1αs)

Fig. 7. Training (left) and Synthetic (right) traces for IPT and PS.

2656 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

Fig. 6 shows how the model captures temporal dynam-ics, showing how both IPT and PS have a significant mem-ory, due to the non negligible auto-covariances. Moreinteresting is the negative cross-covariance, confirming

that IPT and PS usually are not at the same time both largeor small.

In Fig. 7 we show the capability of the model to jointlyreproduce time series of both IPT and PS by synthetic gen-

Page 13: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

0 50 100 150 2000

50

100

IPT sequence

10 lo

g 10(IP

T/1μ

s)

sequence number

data monitoring prediction

0 50 100 150 2000

500

1000

1500

2000

PS sequence

PS (b

ytes

)

sequence number

data monitoring prediction

Fig. 8. Monitoring and prediction for SMTP: IPT (top) and PS (bottom).

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2657

eration of traffic patterns, and thus it may be integratedinto a traffic generation or simulation framework.

5.1.2. HTTP trafficIn Fig. 5 the fitting of marginal distributions is shown.

Note that IPT are spread over eight orders of magnitude,but the majority of them is concentrated approximatelybetween 10 ms and 1 s. These values are compatible withRTTs found in Wide Area Networks. Indeed HTTP clients of-ten perform a lot of subsequent requests to the same ser-ver. In the case of Web, for example, the first request ofan HTML document is typically followed by more requestsfor the embedded objects. If such objects are small enoughto be sent within one or few packets (as often is the case[46]), the requests are sent with intervals close to theRTT from the client to the server. As for the PS distribution,the HMM model captures the characteristics of real databut seems to slightly under-estimate small values in favorof larger ones.

Differently from the previous traffic typology HTTP doesnot present any dominant state while alternating variousbehaviors. It appears less regular also because states withall combination of small/large IPT and PS are present, asconfirmed by the small cross-covariance. The correlationstructures of HTTP, shown in Fig. 6, are very interesting.They present correlation at several lags with an oscillatingbehavior. The envelope decays faster for cross- than auto-covariances, and it is accurately captured by the trainedmodel. It is worth saying that the model trained with sin-

gle sessions (here not reported) captured the oscillatingbehavior too, while for the whole traffic, where differentkinds of sessions are considered, this is quite hard andwas not possible. Fig. 7 shows the results of syntheticallygenerated HTTP traffic patterns.

5.1.3. AoM trafficN ¼ 4 states showed to be sufficient to capture the

behavior of the data. Fig. 5 shows how PS are usually smal-ler than 20 bytes and concentrated around few close val-ues, while on the contrary the IPT distribution presents abi-modal behavior, with the 2 modes separated by morethan 2–3 orders of magnitude. We found similar behaviorsin other real-time strategy games, where stations typicallysend periodic update packets plus additional update pack-ets when a user action must be immediately transmitted[47]. This evident link between user actions and packet-le-vel traffic is probably one of the causes of the more ran-domness that was found in AoM traffic, when comparedto the other sources considered. Table 8 shows more spe-cifically how both the IPT modes are associated to the samerange of PS, although the mode with lower IPTs (s2) is morespread in terms of PS with respect to the mode with largerIPTs (s1). In addiction, looking at steady-state probabilitiesand conditional durations in Table 8 we can affirm that s1

and s2 capture the 2 typical situations of real-time strategygames (periodic updates and user actions) while s3 and s4

are transient model introduced by the model. Fig. 5 con-firms the previous analysis.

Page 14: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

2658 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

Fig. 6 shows the results obtained for auto- and cross-covariance. We found that all three covariances rapidly de-cay, an aspect that is well captured by the trained model.Also note (see cross-covariance at Lag 0) the presence ofa small dependence between IPT and PS of the single pack-et, well captured by the trained model. This denotes thedominance of the large IPT–small PS mode.

Fig. 7 shows how the model is able to accurately repro-duce the AoM traffic pattern.

5.1.4. MSN trafficMSN presents some similar characteristics to SMTP:

two main modes for both IPT and PS as shown in Fig. 5,captured by states s1 and s4 as shown in Table 9. The for-mer accounts for low bitrate behavior, large IPT(�60 dBl) and small PS (�0.1 KB), while the latter for highbitrate behavior, small IPT (�40 dBl) and large PS(�1.3 KB). Again the negative cross-covariance in Fig. 6confirms this kind of coupling between IPT and PS.Whereas auto-covariance reveals the presence of memoryfor IPT and PS characteristics. Fig. 6 shows an exponentiallydecaying trend for the data that is well-captured by themodel, denoting the presence of a significant dependencebetween IPT–PS pairs of successive packets.

In Fig. 7, it can be seen that the model is able to accu-rately reproduce the MSN Messenger traffic patterns, rep-licating the two main IPT and PS modes. Also in the caseof this traffic category, these results look promising as re-

0 50 1000

50

100

IPT se

10 lo

g 10(IP

T/1μ

s)

sequenc

data

0 50 1000

500

1000

1500

PS se

PS (b

ytes

)

sequenc

data

Fig. 9. Monitoring and prediction for H

gards the model suitability for synthetic traffic generation[7] and simulation.

5.2. Prediction: some preliminary result

After the presentation of the performance of the HMM-based proposed model, here we show some preliminary re-sult of its prediction capability. Indeed, the correlationstructure of the various traffic typologies suggests to usethe trained model for prediction purposes on a sampletrace. The main objective is to show the capability of themodel to provide the expected short-term future behaviorof the traffic with sufficient accuracy. Such a characteristicresults particularly appealing when thought as part of amore complex network-sensing and adaptive-manage-ment system. In order to give an idea of what kind of infor-mation the proposed approach could provide to higher-level applications, we performed (off-line) the followingbasic steps on the traces previously described:

� Monitoring – W samples (in terms of IPT–PS pairs) areobserved iteratively to obtain an estimate of the currentstate via the Viterbi algorithm [25];

� Prediction – on the basis of the current state estimateand of the trained model parameters, the traffic isassumed to remain in that state (thus keeping condi-tional mean values for IPT and PS) for number of sam-ples proportional to the conditional duration (Eq. (4)).

150 200

quence

e number

monitoring prediction

150 200

quence

e number

monitoring prediction

TTP: IPT (top) and PS (bottom).

Page 15: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

0 50 100 150 2000

50

100

IPT sequence

10 lo

g 10(IP

T/1μ

s)

sequence number

data monitoring prediction

0 50 100 150 2000

500

1000

1500

PS sequence

PS (b

ytes

)

sequence number

data monitoring prediction

Fig. 10. Monitoring and prediction for AoM: IPT (top) and PS (bottom).

0 50 100 150 2000

50

100

IPT sequence

10 lo

g 10

(IPT/

1μs)

sequence number

data monitoring prediction

0 50 100 150 2000

500

1000

1500

2000

PS sequence

PS (b

ytes

)

sequence number

data monitoring prediction

Fig. 11. Monitoring and prediction for MSN: IPT (top) and PS (bottom).

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2659

Page 16: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

2660 A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662

Figs. 10, 8, 9, 11 show results in the case of AoM, SMTP,HTTP and MSN traffic. We considered a W ¼ 3-sampleobservation to obtain current-state estimate. Also, we as-sume that the traffic holds on conditional mean values, re-fer to Eq. (2), for a number of samples proportional to theconditional duration, refer to Eq. (4), of the state. The val-ues of the conditional statistics, used for shown results,are those reported in Tables 6–9. Asterisks, circles, and dia-monds represent data, monitored samples, and predictedsamples, respectively. Also, for better precision, a confi-dence interval proportional to the conditional standarddeviation, refer to Eq. (2), is reported with a green segment.

Comparing the frequent superposition between aster-isks and diamonds, it can be noticed how the model cap-tures and predicts the traffic dynamics for all theconsidered traffic typologies. Such a result is quite interest-ing, especially when looking at the source behavior whenbeing in the states with large duration, i.e. s1 for SMTP(whose conditional mean values are 57 dBl and 48 bytes),s2 for HTTP (whose conditional mean values are 54 dBland 345 bytes), s1 for AoM (whose conditional mean valuesare 49 dBl and 12 bytes), s1 for MSN (whose conditionalmean values are 66 dBl and 104 bytes).

Also, the very small ratio between diamonds and circlesgives an idea of the small amount of sensing that is neededto infer quite reliable information of what we should ex-pect in the short term future behavior of the traffic. Noteagain that, due to the joint modeling we proposed, estima-tion of the state variable allows to infer knowledge aboutboth IPT and PS expected behavior simultaneously.

Table 10 shows the relative mean square error (RMSE)and the percentage of monitoring data (MP) with respectthe three different sizes of the monitoring window. It isapparent how, for all the considered traffic typologies,the RMSE is not significantly affected by the window size,both for IPT and PS, while obviously the MP is increasingwith it. IPT prediction is very welle performing for all theconsidered traffic typologies, especially for AoM andMSN, while PS prediction is not very accurate for SMTPand MSN traffic. Indeed, the trained models for such appli-cations present some discrepancy with the training data inPS pdf as well PS auto-covariance.

It is worth noticing that the experiment we have per-formed is just a simple not-optimum example of using

Table 10RMSE and MP for prediction with different sizes for the monitoring window

Traffic Window size IPT RMSE PS RMSE MP (%)

SMTP W = 3 0.12 0.44 37W = 5 0.16 0.58 48W = 7 0.15 0.52 56

HTTP W = 3 0.13 0.30 24W = 5 0.12 0.27 33W = 7 0.12 0.27 40

AoM W = 3 0.027 0.13 27W = 5 0.027 0.13 38W = 7 0.027 0.14 46

MSN W = 3 0.024 0.43 30W = 5 0.020 0.42 41W = 7 0.024 0.41 50

the model. Another possibility could be building from thestate estimated via the Viterbi algorithm an m-best list ofmost likely n-transitions. The real value of the model whenused jointly with a monitoring algorithm, is the probabilis-tic representation in terms of the state matrix A of the pos-sible evolution of the traffic.

6. Discussion and conclusion

In this work, we proposed a HMM-based model of traf-fic sources at packet level. It jointly models IPT and PS ofInternet applications traffic. It has been shown how theproposed HMM approach is able to capture the behaviorof marginal distributions, mutual dependencies, and tem-poral structures of the traffic generated by a heterogeneousset of sources. The capability to accurately replicate andpredict traffic makes the proposed approach quitepromising.

Results obtained from four kinds of traffic sources, re-lated to totally different Internet applications, have beenanalyzed. Empirical data clearly show that this heteroge-neity is also reflected by the traffic that they generate atpacket level. Indeed they differ for the behavior of the mar-ginal distributions of IPT and PS but also for their correla-tion structure. As for the last point, it is worth notingthat we found larger autocorrelations with a slower decayfor SMTP, HTTP, and MSN traffic when compared to AoM.Such behavior can be partially explained by the influenceof TCP end-to-end flow control, which introduces depen-dencies between IPTs. Indeed, while SMTP, HTTP, andMSN run over TCP, AoM traffic is carried by UDP packets.Furthermore, rigid application-level protocol rules of SMTPand HTTP induce more structure into their traffic patterns.On the other side, as regards AoM, the interaction of thegaming user introduces more randomness into the traffic.Again, we underline that the paper aims at modeling theaverage behavior of a single session. The study of thesuperposition of several sessions, generated by multiplesources may indeed lead to the generation of an aggregatetraffic showing long range dependence and self similaritycharacteristics, but such investigation falls beyond thescope of the present work.

In all the cases the level of computational and structuralcomplexity associated to the model is quite low. Trainingmodels for SMTP, HTTP, AoM and MSN required few itera-tions, and though SMTP and HTTP traffic present a muchmore complex structure, they only required one more state(with respect to AoM and MSN) for effective modeling.Then, the flexibility of an HMM approach, even when ap-plied to a low-level traffic modeling, appears quiteencouraging.

Concluding, it is worth highlighting that the more excit-ing result of the proposed model is, in our opinion, thecapability to fit at the same time both IPT and PS statisticsand dynamics, even if not obtaining extreme accuracy, offour different traffic sources with a relative small set ofparameters. Benefits and possible applications of suchmodeling approach include: (i) a better understanding ofsource traffic dynamics (taking into account also temporalstructures) related to different Internet applications; (ii)

Page 17: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

A. Dainotti et al. / Computer Networks 52 (2008) 2645–2662 2661

exploitation of the short-term prediction capabilities of themodel; (iii) usage in traffic simulation and generationframeworks [7]; (iv) application for traffic classificationpurposes. Finally, we foresee the integration of thepresented model within a larger analytical framework,based on HMMs, which includes modeling of heteroge-neous packet channels [18,19].

References

[1] A. Dainotti, A. Pescapé, P. Salvo Rossi, G. Iannello, F. Palmieri, G.Ventre, An HMM approach to internet traffic modeling, in: Proc. ofIEEE GLOBECOM, November 2006, pp. 1–6.

[2] A. Dainotti, A. Pescapé, G. Ventre, A packet-level characterization ofnetwork traffic, in: Proc. of IEEE CAMAD, June 2006, pp. 38–45.

[3] B.A. Mah, An empirical model of HTTP network traffic, Proc. of IEEEINFOCOM, vol. 2, April 1997, pp. 592–600.

[4] J. Cao, W.S. Cleveland, Y. Gao, K. Jeffay, F.D. Smith, M.C. Weigle,Stochastic models for generating synthetic http source traffic, in:Proc. of IEEE INFOCOM, vol. 3, 2004, pp. 1546–1557.

[5] R. Ohri, E. Chlebus, Measurement based e-mail trafficcharacterization, in: Proc. of Int. Symp. on Performance Evaluationof Computer and Telecommunication Systems (SPECTS),Philadelphia, PA, July 2005.

[6] W. Feng, F. Chang, W. Feng, J. Walpole, A traffic characterization ofpopular on-line games, IEEE/ACM Transactions on Networking 13 (3)(2005) 488–500.

[7] <http://www.grid.unina.it/software/ITG>, September 2007.[8] J. Farber, Network Game Traffic Modelling, NetGames2002,

Braunschweig, Germany, 2002. pp. 53–57.[9] R. Bangun, E. Dutkiewicz, Modelling multi-player games traffic, in:

Proc. of International Conference on Information Technology: Codingand Computing, 27–29 March 2000, pp. 228–233.

[10] M. Borella, Source models of network game traffic, ComputerCommunications 23 (4) (2000) 403–410.

[11] T. Lang, G.J. Armitage, P. Branch, H. Choo, A synthetic traffic modelfor half-life, in: Proc. of Australian Telecommunication Networksand Application Conference 2003 (ATNAC 2003), Melbourne,Australia, December 2003.

[12] P. Salvador, A. Pacheco, R. Valadas, Modeling IP traffic: jointcharacterization of packet arrivals and packet sizes using BMAPs,Elsevier Computer Networks 44 (October) (2004) 335–352.

[13] A. Klemm, C. Lindemann, M. Lohmann, Modeling IP traffic using thebatch Markovian arrival process, Performance Evaluation Journal 54(2) (2003) 149–173.

[14] J. Gao, I. Rubin, Multifractal analysis and modeling of long-range-dependent traffic, in: Proc. of IEEE ICC, June 1999, pp. 382–386.

[15] A.T. Andersen, B.F. Nielsen, A Markovian approach for modelingpacket traffic with long-range dependence, IEEE Journal on SelectedAreas in Communications 16 (5) (1998) 719–732.

[16] K. Salamatian, S. Vaton, Hidden Markov Modeling for networkcommunication channels, in: Proc. of ACM SIGMETRICS 2001, vol.29, 2001, pp. 92–101.

[17] W. Wei, B. Wang, D. Towsley, Continuous-time Hidden MarkovModels for network performance evaluation, PerformanceEvaluation 49 (1–4) (2002) 129–146.

[18] P. Salvo Rossi, G. Romano, F. Palmieri, G. Iannello, Joint end-to-endloss-delay Hidden Markov Model for periodic UDP traffic over theInternet, IEEE Transactions on Signal Processing 54 (2) (2006) 530–541.

[19] G. Iannello, F. Palmieri, A. Pescapè, P. Salvo Rossi, End-to-end packet-channel Bayesian model applied to heterogeneous wirelessnetworks, in: Proc. of IEEE GLOBECOM, November 2005, pp. 484–489.

[20] L. Muscariello, M. Mellia, M. Meo, M.A. Marsan, R. Lo Cigno, Markovmodels of internet traffic and a new hierarchical MMPP model,Computer Communications Journal 28 (16) (2005) 1835–1851.

[21] O. Rose, Simple and efficient models for variable bit rate MPEG videotraffic, Performance Evaluation Journal 30 (1) (1997) 69–85.

[22] E. Costamagna, L. Favalli, F. Tarantola, Modeling and analysis ofaggregate and single stream internet traffic, in: Proc. of IEEEGLOBECOM, December 2003, pp. 3830–3834.

[23] C. Wright, F. Monrose, G. Masson, HMM profiles for network trafficclassification, in: Proc. of VizSEC/DMSEC, October 2004, pp. 9–15.

[24] <http://www.grid.unina.it/Traffic/>, September 2007.

[25] L.R. Rabiner, A tutorial on Hidden Markov Models and selectedapplications in speech recognition, Proceedings of the IEEE 77 (2)(1989) 257–285.

[26] J.A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Applicationto Parameter Estimation for Gaussian Mixture and Hidden MarkovModels, University of Berkeley, CA, Technical Report ICSI-TR-97-021,1998.

[27] L.A. Liporace, Maximum likelihood estimation for multivariateobservations of Markov sources, IEEE Transactions on InformationTheory IT-28 (5) (1982) 729–734.

[28] B.H. Juang, S.E. Levinson, M.M. Sondhi, Maximum likelihoodestimation for multivariate mixture observations of Markov chains,IEEE Transactions on Information Theory IT-32 (2) (1986) 307–309.

[29] E. Çinlar, Introduction to Stochastic Processes, Prentice Hall, 1975.[30] <http://nile.wpi.edu/downloads>, September 2007.[31] R.B. Jennings, E.M. Nahum, D.P. Olshefski, D. Saha, Zon-Yin Shae, C.

Waters, A study of Internet instant messaging and chat protocols,IEEE Network 20 (4) (2006) 1621.

[32] X. Zhen, G. Lei, J. Tracey, Understanding instant messaging trafficcharacteristics, in: Proc. of the 27th International Conference onDistributed Computing Systems (IEEE ICDCS 2007), Toronto, Canada,25–29 June 2007.

[33] S. McCreary, K. Claffy, Trends in wide area IP traffic patterns – a viewfrom Ames Internet Exchange, in: Proc. of ITC Specialist Seminar ofMeasurement and Modeling of IP Traffic, September 2000, pp. 1–11.

[34] <http://www.microsoft.com/games/ageofmythology/>, September2007.

[35] <http://www.comscore.com/>, September 2007.[36] <http://join.msn.com/messenger/overview>, September 2007.[37] <http://www.microsoft.com/technet/prodtechnol/isa/2000/maintain/

isaimsec.mspx>, September 2007.[38] <http://www.hypothetic.org/docs/msn/general/overview.php>,

September 2007.[39] M. Claypool, The effect of latency on user performance in real-time

strategy games, Elsevier Computer Networks 49 (1) (2005) 52–70.[40] A. Dainotti, A. Botta, A. Pescapé, G. Ventre, Searching for invariants in

network games traffic, in: Proc. of ACM Co-Next 2006 StudentWorkshop, Lisboa, Portugal, December 2006.

[41] S. McCanne, V. Jacobson, The BSD packet filter: a new architecturefor user level packet capture, in: Proc. of Winter 1993 USENIX,January 1993, pp. 259–269.

[42] T. Karagiannis, A. Broido, M. Faloutsos, K. Claffy, Transport layeridentification of P2P traffic, in: Proc. of ACM SIGCOMM IMC, October2004, pp. 121–134.

[43] <http://www.wide.ad.jp/wg/mawi/>, September 2007.[44] R. Caceres, P. Danzig, S. Jamin, D. Mitzel, Characteristics of wide-area

TCP/IP conversations, ACM SIGCOMM Computer CommunicationReview 21 (4) (1991) 101–112.

[45] P. Danzig, S. Jamin, R. Caceres, D. Mitzel, D. Estrin, An empiricalworkload model for driving wide-area TCP/IP network simulations,Journal of Internetworking: Research and Experience 3 (1) (1992) 1–26.

[46] F.D. Smith, F.H. Campos, K. Jeffay, D. Ott, What TCP/IP procotolheaders can tell us about the Web, in: Proc. of ACM SIGMETRICS,June 2001, pp. 245–256.

[47] A. Dainotti, A. Pescapé, G. Ventre, A packet-level model of Starcrafttraffic, in: Proc. of IEEE Hot-P2P, July 2005, pp. 33–42.

[48] S.P. Pederson, M.E. Johnson, Estimating model discrepancy,Technometrics 32 (3) (1990) 305–314.

Alberto Dainotti is Ph.D. student in ComputerEngineering and Systems at the ComputerScience Department of University of Napoli‘‘Federico II”, Italy, where he received the M.S.Laurea Degree in Computer Engineering in2004. His research interests fall in the areas ofnetwork measurements, traffic analysis, andnetwork security.

Page 18: Computer Networks - unina.itwpage.unina.it/alberto/papers/comnet_hmm_from_science_direct.pdf · Internet traffic modeling by means of Hidden Markov Modelsq,qq Alberto Dainottia,

Netw

Antonio Pescapé is Assistant Professor at theDepartment of Computer Engineering andSystems of the University of Napoli FedericoII. He received the M.S. Laurea Degree inComputer Engineering and the Ph.D. in Com-puter Engineering and Systems at Universityof Napoli Federico II. His research interests arein the networking field with focus on modelsand algorithms for Internet Traffic, NetworkMeasurement and Management of heteroge-neous IP networks, and Network Security. Hehas co-authored a large number of journal and

conference publications. He is IEEE member and he has served and serveson several conference technical program committees (IEEE Globecom,IEEE ICC, IEEE WCNC, IEEE HPSR, etc.) and has served as Guest Editor of

2662 A. Dainotti et al. / Computer

the Special Issue of Computer Networks on ‘‘Traffic classification and itsapplications to modern networks”.

Pierluigi Salvo Rossi was born in Naples,Italy, on April 26, 1977. He received the‘‘Laurea” degree in TelecommunicationsEngineering (summa cum laude) in January2002 and the Ph.D. in Computer Science inJanuary 2005, both from the University ofNaples ‘‘Federico II”, Naples, Italy. In 2002, heworked as a Research Engineer at the CIRASS(Interdepartmental Research Center for SignalAnalysis and Synthesis), University of Naples‘‘Federico II”, Naples, Italy. In 2003, he workedas Research Engineer at the Department of

Information Engineering, Second University of Naples, Aversa (CE), Italy.In 2004, he was Visiting Research Engineer at the CSPL (Communicationsand Signal Processing Laboratory), Electrical and Computer Engineering

Department, Drexel University, Philadelphia, PA, US. In 2005, he workedas Postdoc Research Engineer at the ITeM (Multimedia Information andTelematic National Laboratory), CINI (Italian University Consortium forComputer Science and Engineering), Naples, Italy. In 2006, he worked asPostdoc Research Engineer at the CRdC–ICT (Regional Institute forResearch on Information and Communication Technology), Benevento,Italy. He is currently a Postdoc Research Engineer at the Department ofElectronics and Telecommunications, Norwegian University of Scienceand Technology, Trondheim, Norway. Since 2004, he is Adjunct Professorat the Second University of Naples, Aversa (CE) Italy. His research inter-ests fall within the areas of speech processing, communication systemmodeling, wireless communications.

Francesco Palmieri received his Laurea inIngegneria Elettronica cum laude from Uni-versità degli Studi di Napoli Federico II, Italy,in 1980. In 1983, he was awarded a Fulbrightscholarship to conduct graduate studies at theUniversity of Delaware, Newark, where hereceived a M.S. degree in applied sciences anda Ph.D. in electrical engineering in 1985 and1987, respectively. In 1981, he served as a 2ndLieutenant in the Italian Army in fullfillmentof draft duties. In 1982 and 1983, he was withthe ITT firms: FACE SUD Selettronica in Sal-

erno (currently Alcatel), Italy, and Bell Telephone Manufacturing Com-pany in Antwerpen, Belgium, as a designer of digital telephone systems.He was appointed Assistant Professor in Electrical and Systems Engi-

orks 52 (2008) 2645–2662

neering at the University of Connecticut, Storrs, in 1987, where he wasawarded tenure and promotion to associate professor in 1993. In thesame year, after a national competition, he was awarded the position ofProfessore Associato at the Dipartimento di Ingegneria Elettronica e delleTelecomunicazioni at Università degli Studi di Napoli Federico II, Italy,where he has been until October 2000. In February 2000, he was nomi-nated Professore Ordinario di Telecomunicazioni after a national com-petition and appointed in November 2000 at Dipartimento di Ingegneriadell?Informazione, Seconda Università di Napoli, Aversa, Italy. His reas-erch interests are in the areas of signal processing, communications,information theory and neural networks.

Giorgio Ventre is Professor of ComputerNetworks in the Department of ComputerEngineering and Systems of the University ofNapoli Federico II, where he is leader of theCOMICS team. COMICS stands for Computersfor Interaction and Communications and is aresearch initiative in the areas of networkingand multimedia communications. After star-ted ITEM, the first research laboratory of theItalian University Consortium for Informatics(CINI), He is now President and CEO of CRIAI, aresearch company active in the areas of

Information Technologies. As leader of the networking research group atUniversity of Napoli Federico II. He is principal investigator for severalnational and international research projects. His research interests are in

the area of network protocols and architectures. He has co-authored morethan 150 publications and he is member of the IEEE and of the ACM.