an information theoretical approach to high-speed flow

14
1 An Information Theoretical Approach to High-Speed Flow Nature Identification Amir R. Khakpour Alex X. Liu Abstract—This paper concerns the fundamental problem of identifying the content nature of a flow, namely text, binary, or encrypted, for the first time. We propose Iustitia, a framework for identifying flow nature on the fly. The key observation behind Iustitia is that text flows have the lowest entropy and encrypted flows have the highest entropy, while the entropy of binary flows stands in between. We further extend Iustitia for the finer-grained classification of binary flows so that we can differentiate different types of binary flows (such as image, video, and executables) and even the file formats (such as JPEG and GIF for images, MPEG and AVI for videos) carried by binary flows. The basic idea of Iustitia is to classify flows using machine learning techniques where a feature is the entropy of every certain number of consecutive bytes. Our experimental results show that the classification can be done with high speed and high accuracy. On average, Iustitia can classify flows with 88.27% of accuracy using a buffer size of 1K with a classification time of less than 10% of packet inter-arrival time for 91.2% of flows. Index Terms—Flow Content Analysis, Flow Identification. I. I NTRODUCTION A. Motivation In this paper, we aim to identify the content nature of flows at high speed. We define the nature of a flow as text, encrypted, or binary. A flow is text if and only if the entire flow content consists of text in a natural language and is human readable (e.g., printable ASCII subset, text/plain MIME type). Typical text flow content includes plain-text HTML pages, email, chat, telnet, source codes, etc. A flow is encrypted if and only if the entire flow content consists of cipher-text. An encrypted flow mostly consists of traffic over SSL, encrypted files, etc. A flow is binary if and only if it is neither text not encrypted. A binary flow mainly consists of binary content, such as executable code, multimedia files, etc. We further classify binary flows into their type (e.g., image, video, and executable) and also the file format (e.g., JPEG and GIF for images, MPEG and AVI for videos). Fast identification of flow nature can enable a variety of applications in networking. Below we give a few examples. Alex X. Liu is the corresponding author of this paper. Email: [email protected]. Tel: +1 (517) 353-5152. Fax: +1 (517) 432-1061. Amir R. Khakpour and Alex X. Liu are with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, 48824. The preliminary version of this paper titled “Iustitia: An Information Theo- retical Approach to High-speed Flow Nature Identification” was published in ICDCS 2009. This work is supported in part by the National Science Foundation under Grant Numbers CNS-0716407, CNS-0916044, and CNS- 0845513. This work is also supported in part by National Natural Science Foundation of China (Grant No. 61272546). E-mail: {khakpour, alexliu}@cse.msu.edu Manuscript received July 9th, 2010. Network monitoring and management. High-speed flow nature identification allows ISPs to prioritize their traffic for better quality of service. Considering an ISP serving a bank and a call center, among the traffic to/from the bank network, the ISP may give higher priority to the encrypted flows because they most likely carry banking transactions. Among the traffic to/from the call center, the ISP may give higher priority to the binary flows because they most likely carry voice data. High-speed flow nature identification also allows ISPs to gather various statistics on traffic passing through a network. Currently, up to 40% of Internet traffic belongs to unknown applications [1]. Identifying the nature of flows helps ISPs to understand better the type of traffic passing through their networks and to manage their networks better. Forensics. High-speed flow nature identification allows ISPs to monitor and efficiently log their traffic for law enforcement purposes when appropriate. For example, identifying binary flows may help copyright enforcement as they may carry copyrighted software and multimedia. As another example, identifying text flows may allow law enforcement to perform complex keyword searching for finding possible human com- munications on the fly. Performance improvement for Intrusion Detec- tion/Prevention Systems (IDS/IPS). With the rapid growth of traffic demands and the increasing complexity, IDSes/IPSes are becoming performance bottlenecks. When performing deep packet inspection (DPI) on each packet cannot meet the performance demand of users, high-speed flow nature identification allows an IDS/IPS to apply binary related attack signatures on binary flows and text related attacks signatures on text flows, which is more efficient than applying all signatures on all flows. Content-based switching, caching, and filtering. The current techniques for content switching, caching and filtering are ei- ther port-based or they require deep packet inspection to parse the flows application headers. The port-based techniques are inaccurate as Internet services can operate on any port rather then their well-known ports and different types of contents can be delivered by a protocol. The DPI-based techniques, on the other hand, are computationally expensive and may not satisfy the performance demand of users. Moreover, these techniques are not useful if the content type and format are not stated on the flow application header. High-speed nature and content format identification can indeed improve these techniques for higher accuracy and better performance. For instance, we can distinguish between encrypted messages, movies, images, executables, and chat messages in a fast and efficient way.

Upload: others

Post on 11-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Information Theoretical Approach to High-Speed Flow

1

An Information Theoretical Approach to

High-Speed Flow Nature IdentificationAmir R. Khakpour Alex X. Liu∗

Abstract—This paper concerns the fundamental problem ofidentifying the content nature of a flow, namely text, binary, orencrypted, for the first time. We propose Iustitia, a frameworkfor identifying flow nature on the fly. The key observationbehind Iustitia is that text flows have the lowest entropy andencrypted flows have the highest entropy, while the entropy ofbinary flows stands in between. We further extend Iustitia forthe finer-grained classification of binary flows so that we candifferentiate different types of binary flows (such as image, video,and executables) and even the file formats (such as JPEG andGIF for images, MPEG and AVI for videos) carried by binaryflows. The basic idea of Iustitia is to classify flows using machinelearning techniques where a feature is the entropy of every certainnumber of consecutive bytes. Our experimental results show thatthe classification can be done with high speed and high accuracy.On average, Iustitia can classify flows with 88.27% of accuracyusing a buffer size of 1K with a classification time of less than10% of packet inter-arrival time for 91.2% of flows.

Index Terms—Flow Content Analysis, Flow Identification.

I. INTRODUCTION

A. Motivation

In this paper, we aim to identify the content nature of flows

at high speed. We define the nature of a flow as text, encrypted,

or binary. A flow is text if and only if the entire flow content

consists of text in a natural language and is human readable

(e.g., printable ASCII subset, text/plain MIME type). Typical

text flow content includes plain-text HTML pages, email, chat,

telnet, source codes, etc. A flow is encrypted if and only if the

entire flow content consists of cipher-text. An encrypted flow

mostly consists of traffic over SSL, encrypted files, etc. A flow

is binary if and only if it is neither text not encrypted. A binary

flow mainly consists of binary content, such as executable

code, multimedia files, etc. We further classify binary flows

into their type (e.g., image, video, and executable) and also

the file format (e.g., JPEG and GIF for images, MPEG and

AVI for videos). Fast identification of flow nature can enable

a variety of applications in networking. Below we give a few

examples.

∗ Alex X. Liu is the corresponding author of this paper. Email:[email protected]. Tel: +1 (517) 353-5152. Fax: +1 (517) 432-1061.

Amir R. Khakpour and Alex X. Liu are with the Department of ComputerScience and Engineering, Michigan State University, East Lansing, MI, 48824.The preliminary version of this paper titled “Iustitia: An Information Theo-retical Approach to High-speed Flow Nature Identification” was publishedin ICDCS 2009. This work is supported in part by the National ScienceFoundation under Grant Numbers CNS-0716407, CNS-0916044, and CNS-0845513. This work is also supported in part by National Natural ScienceFoundation of China (Grant No. 61272546).

E-mail: {khakpour, alexliu}@cse.msu.eduManuscript received July 9th, 2010.

Network monitoring and management. High-speed flow

nature identification allows ISPs to prioritize their traffic for

better quality of service. Considering an ISP serving a bank

and a call center, among the traffic to/from the bank network,

the ISP may give higher priority to the encrypted flows because

they most likely carry banking transactions. Among the traffic

to/from the call center, the ISP may give higher priority to

the binary flows because they most likely carry voice data.

High-speed flow nature identification also allows ISPs to

gather various statistics on traffic passing through a network.

Currently, up to 40% of Internet traffic belongs to unknown

applications [1]. Identifying the nature of flows helps ISPs

to understand better the type of traffic passing through their

networks and to manage their networks better.

Forensics. High-speed flow nature identification allows ISPs

to monitor and efficiently log their traffic for law enforcement

purposes when appropriate. For example, identifying binary

flows may help copyright enforcement as they may carry

copyrighted software and multimedia. As another example,

identifying text flows may allow law enforcement to perform

complex keyword searching for finding possible human com-

munications on the fly.

Performance improvement for Intrusion Detec-

tion/Prevention Systems (IDS/IPS). With the rapid growth of

traffic demands and the increasing complexity, IDSes/IPSes

are becoming performance bottlenecks. When performing

deep packet inspection (DPI) on each packet cannot meet

the performance demand of users, high-speed flow nature

identification allows an IDS/IPS to apply binary related attack

signatures on binary flows and text related attacks signatures

on text flows, which is more efficient than applying all

signatures on all flows.

Content-based switching, caching, and filtering. The current

techniques for content switching, caching and filtering are ei-

ther port-based or they require deep packet inspection to parse

the flows application headers. The port-based techniques are

inaccurate as Internet services can operate on any port rather

then their well-known ports and different types of contents can

be delivered by a protocol. The DPI-based techniques, on the

other hand, are computationally expensive and may not satisfy

the performance demand of users. Moreover, these techniques

are not useful if the content type and format are not stated

on the flow application header. High-speed nature and content

format identification can indeed improve these techniques for

higher accuracy and better performance. For instance, we

can distinguish between encrypted messages, movies, images,

executables, and chat messages in a fast and efficient way.

Page 2: An Information Theoretical Approach to High-Speed Flow

2

B. Technical Challenges

Although identifying flow nature is a fundamental net-

working problem, to our best knowledge, it has not been

investigated in prior literature. High-speed flow nature identi-

fication is a technically challenging problem. First, flow nature

identification cannot rely on well-known port numbers. Any

port number can be potentially customized for any application.

Thus, flow nature identification has to rely on examining

packet payload. Second, examining packet payload at high

speed is difficult for high bandwidth links.

C. Our Information Theoretical Approach

In this paper, we propose Iustitia1, a fast content-based flow

classifier that classifies flows into text, binary, and encrypted.

When a flow is started, Iustitia quickly identifies its nature

based on a small number of buffered packet payload and

forwards it to its corresponding buffer. The key observation

behind Iustitia is that text flows have the lowest entropy and

the encrypted flows have the highest entropy, while the entropy

of binary flows stands in between. The basic idea of Iustitia is

to classify flows using machine learning techniques where a

feature is the entropy of every certain number of consecutive

bytes.

The architecture of Iustitia is shown in Fig. 1. The classifier

has a Classification Table (CT) that contains flow IDs with

their corresponding class label. Upon arrival of a packet,

the classifier calculates the hash of the packet header to

identify its flow ID. If the flow ID is found in CT, the

flow splitter forwards the packet to its corresponding output

queue. Otherwise, the packet is buffered in its corresponding

queue. When the buffer of a flow is full, a set of entropy-

based classification features are extracted, where a feature is

the entropy of every certain number of consecutive bytes.

These extracted features are then fed into the classification

module. The output of the classification module is a label of

text, binary, or encrypted for coarse-grained classification, and

other binary sub-labels for finer-grained classification. When

the nature of a flow is identified, the flow ID along with its

corresponding label are stored in CT. Flows carried by TCP

are removed from CT when FIN or RST packets are seen.

Flows in CT are also removed when they are inactive for a

certain amount of time. The classification module is trained

offline using labeled data points extracted from known flows.

As Internet traffic evolves over time, it is important to retrain

the classifier frequently to keep it updated for achieving high

accuracy classification.

We further use Iustitia for finer-grained classification of

binary and encrypted flows. We first classify binary flows

into executables, images, documents, compressed, video, and

audio. Second, we further classify binary flows based on

application-specific extensions. For instance, we classify im-

age flows further into the categories of JPEG, GIF, PNG, etc.

For encrypted flows, we further classify them based on their

encryption algorithms, namely AES, PGP, DES3, BF, DES,

and RC4.

1Iustitia (aka Lady Justice) is the Roman goddess of Justice, depicted andsculptured mostly blindfolded with a sword and measuring balances.

Learning using

Decision Tree/

Support Vector

Machine (SVM)

Classification

Text

Binary

Encrypted

Training

Set

Testing

Set

Decision Tree

Model

Support Vectors

(SVs)

Header

Hash

Calculator

(fid)

Is fid in

CT?

Flow Splitter

Entropy Vector

Calculator/

Estimator

Get flow

label from

CT

Classification

Table (CT)

Update CTQuery

Output Queues

Binary

Encrypted

Text

yes

no

Online Process Offline Process

LQ1

LQ2

fid ck

2 1

fid1fid2

Remove

Application

Layer Header

Packetsother labels

Fig. 1. Iustitia Architecture

II. RELATED WORK

To our best knowledge, there is no prior work on classifying

flows as text, binary, and encrypted. Prior work on traffic

classification for anomaly detection purposes mainly focuses

on the inspection of packet headers and the detection of

different network attacks using the statistical features of packet

header fields [2], [3], [4]. The anomalies that can be detected

and identified with such techniques are limited to attacks that

can be detected based on packet headers, such as port scans,

DOS, etc. These techniques are unable to discern attacks that

cannot be detected solely based on packet headers such as

malware, viruses, etc. Prior work on packet payload classifica-

tion using machine learning techniques focuses on identifying

what application the flow is associated with or whether a

flow contains code. Signature-based approaches include the

schemes for detecting well-known application protocols using

packet size, timing characteristics, and packet payload [5], [6],

[7], [8], [9]. TIE [7] and NeTraMark [8] are two examples

of traffic classification platforms and benchmarks for traffic

classification to classes such as web, FTP, DNS, games,

file sharing, etc. Machine learning and other classification

techniques have also been used to infer application protocols

from encrypted flows [10], [11], [12]. Yet, these techniques

are orthogonal to Iustita as they classify flows based on

their application protocol rather than their nature. Iustitia is

orthogonal to that line of work as it uses the randomness of

a flow payload to identify the nature of the flow. Thus, the

domain of applications of such work is different from Iustitia.

Some work has been done on file type identification using

file “fingerprints” (or called “fileprints”) [13], [14]. The fin-

gerprint of a set of files with same file type is the histogram

of byte frequencies along with their variances. Given all file

type fingerprints, an unknown file is identified based on its

Mahalanobis distance to a file type fingerprint. These file type

identification algorithms cannot solve our problem because we

cannot buffer all packets in a flow to make it a file.

Recently, Finamore et al. proposed KISS, a method for

identifying application protocol types based on packet payload

[15]. KISS is designed specifically for UDP traffic and uses

chi-square like test to extract payload statistical features. Wang

et al. tested Iustita on 8 different real-life datasets in [16]

and yield similar results reported in this paper. They further

break one byte into two 4-bits and use Iustita to improve the

classification accuracy on encrypted traffic.

Page 3: An Information Theoretical Approach to High-Speed Flow

3

III. FILE CLASSIFICATION USING ENTROPY VECTOR

In this section, we first show how the randomness of a file

can be represented by an entropy vector. Second, we show how

we can change the entropy vector such that it can be computed

on-the-fly. Third, we present two hypotheses, which will be

used as the basis of Iustitia. Forth, we show to what level of

accuracy these hypotheses are true and how file classification

can be useful for flow classification.

A. Measuring Randomness Using Entropy

According to information theory, the entropy of a se-

quence of m elements, where each element is in set S ={e1, e2, . . . , en}, is defined as −

∑n

i=1

mi

mlog(mi

m), where mi

is the number of occurrences of element ei . In this paper,

we use normalized entropy where the base of logarithm is nand the unit of entropy is element per symbol. Note that we

assume 0 log 0 = 0. The entropy of a sequence of elements

is a measure of the diversity or randomness of the sequence.

The minimum entropy value is 0 if all m elements in the

sequence are the same, and the maximum entropy value is 1

if all n elements occur the same number of times (i.e., m/n)

in the sequence. For simplicity, we use “entropy” to mean

“normalized entropy” in the rest of the paper.

To compute the entropy of the content of a file F with

size m, we first consider F as a Markov information source

where each state (i.e., element)is a character. We then rep-

resent the entropy of the content of F by an entropy vector

H(F ) = 〈ℏ1, · · · , ℏm〉, where ℏi is the entropy of the file that

is modeled as (i−1)-order Markov information source. Hence,

ℏi is calculated as follows,

ℏi = −∑

a1∈S

P(a1 )∑

a2∈S

P(a2 |a1 ) · · ·∑

ai∈S

P(ai |a1 · · · ai−1 ) log P(ai |a1 · · · ai−1 )

(1)

where ai is a state (i.e., an specific character), P (a1) is the

probability of occurrence of a1 (i.e., P (a1) = P (a1 = ej) =mj/m), and P (a2|a1) is the probability of occurrence of a2given occurrence of a1 as the preceding character (i.e., P (a2 =ej |a1 = el) = mjl/ml) [17].

Computing ℏi using formula 1 has time and space complex-

ity of O(mi); therefore, the complexity of computing H(F )is O(mm). Due to exponential complexity of computing an

entropy vector, it is very difficult to measure randomness of

the file (or a flow) using the entropy vector on-the-fly.

B. Measuring Randomness on-the-fly

We measure the randomness of a file content in quadratic

time and space complexity (O(m2)) by overlooking the condi-

tional probability of elements and calculate entropy assuming

the elements are independent. Hence, given a file F , we can

treat each byte as an element and F as a multiset of such

elements, and henceforth calculate the entropy of F over the

set of all 28 possible bytes. In general, we can treat each

consecutive k bytes in file F as an independent element and

F as a multiset of such elements, then calculate the entropy

of F over the set of all possible 28k sequences of k bytes. For

example, given F = 〈a, b, c, d〉, we can treat each consecutive

2 bytes as an element and F as a multiset {ab, bc, cd}, and

then calculate the entropy of F over the set of all possible 216

sequences of two bytes. Let fk denote the set of all possible

28k sequences of k bytes, and hk denote the entropy of a given

file F of m bytes, which is deemed as a multiset of sequences

of k bytes over set fk. We calculate hk using Formula (2)

where mik is the number of occurrences of fk’s i-th element

in F . Note that |F | =∑|fk|

i=1mik = m−k+1 and |fk| = 28k.

hk = −

|fk|∑

i=1

mik

m− k + 1log|fk|(

mik

m − k + 1)

=1

m− k + 1[∑

i

mik log(m − k + 1) −∑

i

mik logmik ]

= log(m − k + 1)−1

m− k + 1

i

mik logmik (2)

In this paper, we calculate hi using formula 2 instead of ℏiusing formula 1 for entropy vectors. For the classification, each

hi is treated as a feature of F , where we call i the feature width

of hi. Accordingly, the new entropy vector is defined some of

F ’s unique features in linear time and space complexity.

C. Hypotheses and Validation

We make the following two hypotheses, which will serve

as the basic of our flow classification scheme: (1) Using file

entropy vectors, we can classify files into text, encrypted, and

binary files. (2) The randomness of the beginning portion of

a file represents the randomness of the entire file.

To validate these two hypotheses, we collected a pool of

three classes of files: 52,273 binary files (including executa-

bles, JPG, GIF, AVI, MPG, PDF, ZIP files), 24,985 text files

(including text documents, manuals, txt files, log files, htmls),

and 13,656 encrypted files (generated using PGP, AES, DES,

etc). We validated hypothesis 1 using 10 times cross-validation

over these files. Each cross-validation uses 6000 files equally

drawn from each class. For each file, we calculated an entropy

vector of size 10 from h1 to h10. Fig. 2 shows our dataset in

the space of three features h1, h2, h3. As expected, most data

points for text files have the lowest entropies, most data points

for encrypted files have the highest entropy values, and most

data points for binary files stand in between.

We then apply two classification methods, decision tree

(CART) [18] and Support Vector Machines (SVM) [19] to

each dataset. The results are shown in Fig. 3(a) and 3(b) and

Table I. The classification accuracy in this paper is represented

by the F1 score which is a single-valued summary of preci-

sion and recall and enables simple comparison of classifier

accuracy for different classes. Using CART, we can achieve

a classification accuracy about 79.19%. The classification

accuracy for each class is very close to the total accuracy

(i.e., the accuracy for all classes). The misclassification rate

for each class is about 10% for CART. Using SVM, which

are commonly used in machine learning for complex feature

spaces, after model selection, we achieve best classification

accuracy rate on SVM with Radial Basis Function (RBF)

kernel by γ = 50 and C = 1000 parameters. For multi-

class classification, we also use DAGSVM [20], which is the

fastest among other multi-class voting methods [21]. Using

this classification model, the total accuracy is improved up

Page 4: An Information Theoretical Approach to High-Speed Flow

4

00.2

0.40.6

0.81 0

0.5

10

0.2

0.4

0.6

0.8

h2

h1

h3

text

binary

encrypted

Fig. 2. Dataset in space of h1,h2, and h3

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

Cla

ssific

ation A

ccura

cy

1 2 3 4 5 6 7 8 9 100

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

Cross Validation Index

Mis

cla

ssific

ation

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

Cross Validation Index

text −> binary

text −> encrypted

binary −> text

binary −> encrypted

encrypted −> text

encrypted −> binary

text

binary

encrypted

total

(a) Classification accuracy using CART (b) Classification using SVM with

RBF kernel (C=1000, γ=50)

Fig. 3. File classification accuracy for CART and SVM

Decision Tree (CART) SVM - RBF Kernel (γ=50,C=1000)Precision/Recall Misclassification Precision/Recall Misclassification

Total Accuracy 79.19% Text Binary Encrypted 86.51% Text Binary Encrypted

Text File 80.90%/79.94% - 8.09% 11.97% 95.69%/78.65% - 3.18% 18.16%Binary File 80.38%/79.34% 8.57% - 12.05% 93.15%/84.11% 3.35% - 12.56%Encrypted File 76.52%/78.25% 10.46% 11.29% - 75.92%/96.79% 0.20% 3.01% -

TABLE ICLASSIFICATION ACCURACY FOR CART AND SVM

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Buffer size from the beginning of the file (KBytes)

JS

D(P

||Q

) (e

lem

ent/

sym

bol)

0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

Buffer size from the beginning of the file (KBytes)0 50 100 150 200 250

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Buffer size from the beginning of the file (KBytes)

text

binary

encrypted

text

binary

encrypted

text

binary

encrypted

(c) Three bytes probability distribution distance(b) Two bytes probability distribution distance(a) Single byte probability distribution distance

Fig. 4. The JSD distance of element frequency distribution of the first b bytes of each file and all bytes of each file

to 86.51%. Comparing CART and SVM, the results show

that when we use SVM instead of CART, the classification

precision for text class increases up to 14.79% while the recall

value decreases slightly. This indicates that by using SVM, we

have significantly less number of false positive an almost the

same number of false negative. However for binary class, if

we use SVM both precision and recall values increase 12.87%

and 4.77%, respectively. For encrypted class, SVM decreases

the precision value by 0.6% but it increases the recall value

significantly by 21.54%. The reason is misclassification rates

(i.e., number of false positive and false negatives) in SVM with

RBF kernel are quite different from the CART’s due to the

nonlinear nature of the RBF kernel. For instance, using SVM,

the misclassification rate for text data points that are classified

as encrypted is increased up to 8%, and the misclassification

rate for encrypted data points that are classified as text is

decreased up to 10%.

To validate hypothesis 2, we use the Jensen-Shannon di-

vergence (JSD) measure (also known as the total divergence

to the average) [22], which is a popular method for cal-

culating the distance between two probability distributions.

JSD is the average of Kullback-Leibler (KL) distances of two

distributions to their average distribution. JSD is smooth and

bounded within [0,1]. It also satisfies the symmetry condition.

Let P and Q be two probability distributions. The KL distance

(also known as relative entropy) between P and Q is defined

as KLD(P‖Q) =∑

i pi logpi

qi. Let M be the average

probability distribution (M = P+Q2

). The Jensen-Shannon

divergence denoted by JSD(P‖Q) is defined as follows. Note

that JSD(P‖Q) = 0 if and only if P = Q.

JSD(P‖Q) =1

2KLD(P‖M) +

1

2KLD(Q‖M)

=1

2(∑

i

pi logpi

mi

+∑

i

qi logqi

mi

) = H(M)−1

2H(P )−

1

2H(Q) (3)

where H(M) = −∑

mi logmi.

In hypothesis 2, for a given file, P is the byte probability

distribution of the first b bytes of the file and Q is the byte

probability distribution for the entire file. We randomly choose

3,000 files from our dataset (1000 file from each class) with

the average file size is 153.93KB (90% of flies are less than

300KB and 10% are between 300KB and 1MB). Fig. 4(a),

(b), and (c) shows JSD(P‖Q) for single byte element set

f1, two bytes element set f2, and three bytes element set

f3, respectively. The results show that the distance between

randomness of the beginning byte string of a file and the

entire file is relatively small. For example, Fig. 4(b) shows

that the two-byte probability distribution only 8KB byte string

from the beginning of a file represents the entire file with

96.57%, 89.26%, and 87.24% of similarity for text, binary,

and encrypted files, respectively. Therefore, we can deduce

that the randomness of the beginning of a file can represents

the randomness of the entire file.

Page 5: An Information Theoretical Approach to High-Speed Flow

5

IV. FLOW CLASSIFICATION USING ENTROPY VECTORS

By treating a data flow as a sequence of bytes, we can

classify flows using our hypotheses based on our hypotheses

in Section III. As no real-world packet traces with payload is

publicly available for our experiments due to privacy concerns,

we use real-world packet header traces with payload randomly

chosen from our collected files. For better classification per-

formance, we first use feature selection methods to shrink the

feature space without considerable accuracy degrading. We

then choose buffer size b such that it is small enough to avoid

long delays and large space usage, yet big enough to keep the

accuracy high. We further improve classification efficiency and

reduce memory consumption using entropy vector estimation.

A. Feature Selection

Feature selection is necessary for many reasons including

improving computational efficiency and alleviating the curse

of dimensionality to achieve higher accuracy. For CART, the

higher a feature is in a tree, the more effective a feature is for

classification. We use a voting scheme for feature selection

on 10 decision trees of 10 cross-validation. In order to find

distinctive features, we prune the trees until we reach the

threshold of 2% decrease in accuracy. Then, the features that

are more frequently used in the pruned trees are chosen as

the selected features. This feature selection method yields the

feature set {h1, h3, h4, h10}. For SVM, we use the Sequential

Forward Searching (SFS) algorithm [23] for feature selection.

Let n be the total number of features and n′ be the number

of selected features. In SFS, we start with an empty set of

features. On each iteration, we add one feature and evaluate

the classification accuracy using SVM. Then, we choose the

feature that has the maximum accuracy among all features. The

search terminates until we select n′ features. As we apply this

algorithm on all cross-validation sets, we use a voting mech-

anism to choose the best features. The feature set chosen by

this process is {h1, h2, h3, h9}. For both CART and SVM, we

prefer features hk with smaller k because they are much more

memory efficient to calculate. Factoring in this preference, the

feature sets that we choose are φCART = {h1, h3, h4, h5}, and

φSVM = {h1, h2, h3, h5}.

Table II shows that after feature selection, classification

accuracy slightly changes where in some cases even increases.

It also shows that replacing h10 by h5 for CART and replacing

h9 by h5 for SVM have little impact on classification accuracy.

Decision Tree (CART)〈h1 . . . h10〉 〈h1, h3, h4, h10〉 〈h1, h3, h4, h5〉

Total 79.19% 79.20% 78.61%

Text File 80.34% 80.23% 79.90%Binary File 79.88% 79.89% 79.05%Encrypted File 77.36% 77.43% 76.87%

SVM - RBF Kernel (γ=50,C=1000)〈h1 . . . h10〉 〈h1, h2, h3, h9〉 〈h1, h2, h3, h5〉

Total 86.51% 86.08% 85.41%

Text File 86.33% 86.29% 86.06%Binary File 88.39% 87.69% 86.65%Encrypted File 85.09% 84.56% 83.85%

TABLE IICLASSIFICATION ACCURACY AFTER FEATURE SELECTION

B. Choosing Buffer Size

Next, we discuss how to determine b, the size of the buffer

used to store packet payload for calculating entropy vectors. To

achieve high time and space efficiency, this buffer needs to be

small. To achieve high classification accuracy, this buffer needs

to be large enough. Finding a good tradeoff for the buffer size

is an important issue. The optimal value of b depends on how

the classifier was trained. If we train the classifier using the

entire content of every training file and classify flows using

their first b bytes, Fig. 5(a) shows that we can achieve an

accuracy of 86% using a buffer of 1K bytes using SVM based

on the same classification model that was used for the file

classification. If we train the classifier using the the first bbytes of every training file and classify flows using their first

b bytes, Fig. 5(b) shows that we can achieve an accuracy of

86% using a buffer of 32 bytes for both SVM and CART.

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (Bytes)

Cla

ssific

ation A

ccura

cy

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (Bytes)

CART SVM−RBF kernel

(b) Classification accuracy by trainingwith the first b bytes of each file

(a) Classification accuracy by trainingwith entire file content

Fig. 5. Classification accuracy based on buffer size b

We choose the second training method because it uses much

smaller buffer size to achieve the same accuracy. Using a

1KB buffer costs much more time and space than using a

32 bytes buffer. Fig. 6 shows the time and space used by

our classifier increase linearly as buffer size increases. Our

classifier was implemented in C++ and the experiments were

carried out on a Linux workstation with AMD 64 Athlon Dual

core and 2.5 GB of RAM. Based on Fig. 6(a), the entropy

vector calculation time using the second training method with

b = 32 is almost 10 times less than the first training method

with b = 1024. Based on Fig. 6(b), the space per flow using the

second training method is 30 times less than the first training

method. The error bars show the max and min values.

102

104

0

5

10

15

20

25

30

35

Buffer Size (b)

Tim

e (

ms)

102

104

102

103

104

Buffer Size (b)

Sp

ace

(B

)

(a) Entropy vector calculation time (b) Entropy vector calculation space

Fig. 6. Entropy vector calculation

Page 6: An Information Theoretical Approach to High-Speed Flow

6

C. Handling Application Layer Headers

Many application protocols have headers. For instance, a

picture in an html page is fetched from a web server in a

separate flow with an HTTP header, which consists of text.

Such a binary flow with application headers in text could be

misclassified if the classification is solely based on the first

few bytes of the flow. For headers of well-known application

protocols, such as HTTP, SMTP, IMAP, and POP, as they

have a known format, our classifier strips them off using

signature based header detection techniques [5], [6] and uses

only application payload for classification purposes. However,

based on Internet flow statistics [1], known application flows

and encrypted flows constitute around 50% and 10% of the

Internet traffic, respectively; thus, almost 40% of flows are

unknown. Those include the flows that have no application

headers (such as FTP file transfer flows [24]) and the flows

that have unknown application headers. As application header

sizes vary for different applications, we use a cut-off threshold

T as a maximum number of header bytes. That is, to calculate

an entropy vector, we treat the (T + 1)-th byte in a flow as

the beginning of the flow.

To examine classification accuracy using cut-off threshold

T , for each experimented flow, we randomly choose an appli-

cation header length Y (Y ≤ T ) assuming that the (Y + 1)-th byte is the true beginning of the flow. We now have three

methods to train our classifier: (1) entire file - we use entire file

content, (2) static-b - we use the first b bytes of each file, (3)

dynamic-b - we use b consecutive bytes starting at a random

location ranging from the first byte to the (T + 1)-th byte.

However for the testing, we use b consecutive bytes starting at

the (T+1)-th byte. Fig. 7 compares the classification accuracy

achieved by these three training methods based on different

buffer sizes. The results show that the classification accuracy

of these three training methods do not significantly differ. This

implies that the randomness of a portion of a flow does not

change significantly over the entire flow. The results also show

that classification accuracy improves as buffer sizes increase.

Comparing SVM and CART, SVM achieves up to 10% better

accuracy than CART for most buffer sizes. Our experimental

results show that with unknown application headers removed,

our classifier achieves a classification accuracy of 80% with a

buffer of 1024 bytes.

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.2

0.4

0.6

0.8

1

Buffer size (b)

Cla

ssific

ation A

ccura

cy

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.2

0.4

0.6

0.8

1

Buffer size (b)

entire static dynamicentire static dynamic

(b) Classification acuracy using CART(a) Classification acuracy using SVM

Fig. 7. Classification accuracy by different training methods

D. Entropy Vector Estimation

1) Overview: Calculating exact entropy vectors for each

flow can be costly in time and space when we use a buffer

of 1K bytes. In Iustitia, we use a (δ, ǫ)-approximation al-

gorithm to estimate Sk ≡∑

imik logmik in Formula (2).

This algorithm is proposed for estimating the entropy of data

streams by Lall et al. in [25]. The basic idea is to use

frequency moments estimation techniques [26]. The estimation

algorithm has relative error of at most ǫ with probability at

least 1 − δ. Thus, given X whose estimated value is X̃ , we

have Pr(|X − X̃| ≤ Xǫ) ≥ 1 − δ. Note that the entropy

estimation algorithm is based on the assumption |fi| ≫ b.This implies that we cannot use the estimation algorithm for

h1 because |f1| = 256 is too small. However, the rest of the

features use the estimation algorithm because |fi| = 256i is

big enough for i ≥ 2.

In the (δ, ǫ)-approximation algorithm, the number of coun-

ters required for entropy estimation is significantly less than

the number of counters required for accurate entropy cal-

culation because the number of elements in fi increases

exponentially. The number of counters for feature φi is g×zi.Note that g and zi are calculated based on δ and ǫ as follows:

zi = ⌈32 log|fi| b/ǫ2⌉, and g = 2 log2(1/δ).

To save space in entropy estimation, we need to ensure that

the number of counters is less than the number of counters in

actual entropy vector calculation. Theoretically, we assign one

counter for each element of a feature, thus the total number

of counters is (∑

∀φi 6=h1|fi|). However, in practice, most of

the counters are 0 as |fi| ≫ b. Let α be the total number of

counters in exact calculation. We calculate the lower bound

for ǫ and δ as follows:∑

∀φi 6=h1

g × zi =∑

∀φi 6=h1

32 log|fi| b · 2 log2(1/δ)

ǫ2< α (4)

Replacing |fi| by 28i, we can get the following bound

for ǫ where Kφ is the feature set coefficient and Kφ =8∑

∀φi 6=h1

1

i.

∀φi 6=h1

8 log2b · log

2(1/δ)

ǫ2i< α ⇒ ǫ >

log2b

αlog2(1/δ) (5)

For our feature sets, KφSVM= 8.26, KφCART

= 6.26. Also,

for our dataset with b = 1024, we estimate α ≃ 1911. Thus,

for our dataset formula (5) is reduced to ǫ > 0.18√

log2(1/δ).Our entropy vector estimation algorithm works in two

stages. In the pre-processing stage, for each feature hi ∈ φ,

array Li contains random g× zi locations in the buffer. In the

online stage, the estimator waits until the buffer is filled or its

timeout is reached. Then for each flow the entropy vector is

estimated as follows. First, for each feature hi and for each

location in Lij , we pick i consecutive bytes in the buffer as

an element, then scan the buffer starting from the location to

the end and count the number of i consecutive bytes that are

equal to that element. Second, we group these counters into

g groups where each group has zi counters. Third, for each

counter cij , we calculate b(cij log cij − (cij − 1) log(cij − 1))as the unbiased estimator for Sk. Forth, for each group, we

calculate the average of the zi unbiased estimators. Fifth, we

calculate the median of the g average values as the estimated

Page 7: An Information Theoretical Approach to High-Speed Flow

7

δ

(i)

Usin

g S

VM

with R

BF

Kern

el

ε

10−2

10−1

100

δ

10−2

10−1

100

δ

10−2

10−1

100

δ

10−2

10−1

100

δ

(ii) U

sin

g D

ecis

ion T

ree (

CA

RT

)

ε

10−2

10−1

100

δ

10−2

10−1

100

δ

10−2

10−1

100

δ

10−2

10−1

100

0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

0

0.2

0.4

0.6

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(a) Text class accuracy rate (d) Total accuracy rate(c) Encrypted class accuracy rate(b) Binary class accuracy rate

Fig. 8. Classification accuracy based on different g and z values

0 10 20 30 40 50 60 70 8010

4

105

106

107

Time Unit (s)

Total # of packets

Total # of flows

CT size w/o purging

CT size with purging

Fig. 9. Comparing CT size with total number offlows and packets

value of Sk. Finally the estimated entropy for feature hi is

calculated using Formula (2).2) Identifying ǫ and δ: As our main objective is to get

higher classification accuracy with smaller number of counters,

we examine classification accuracy for each possible value of ǫand δ. As shown in Fig. 2, the data points for different classes

are close to each other in the space and they are not well

segregated. This implies that the (δ, ǫ)-estimated entropy vector

data points induce more overlaps rather than exact entropy

vector data points. Thus, we expect the misclassification rate

to increase. Based on the experimental results using the

(δ, ǫ)-approximation algorithm shown in Fig. 8, we make the

following observations: (1) Using SVM with the dynamic-btraining method where 1K buffer size, we have a classification

accuracy of 81.3%, and the optimal values of ǫ and δ are

0.25 and 0.75, respectively. However, to improve classification

accuracy, we reapply the model selection on the new dataset,

and obtain a classification accuracy of 83% using γ = 10and C = 1000. Fig. 8(i) shows the classification accuracy

for different classes using different ǫ and δ values classified

by SVM with new model parameters. (2) Using CART with

the dynamic-b training method where 1K buffer size, we have

a classification accuracy of 76.03%, and the optimal values

of ǫ and δ are 0.5 and 0.1, respectively. Fig. 8(ii) shows the

classification accuracy for different classes using different ǫand δ values. (3) The entropy vector estimation algorithm is

not effective for small buffers of size such as 32 bytes.

3) Time and Space Analysis: Table III compares the time

and space requirement for calculating and estimating entropy

vectors. Comparing the entropy vector exact calculation and

estimation, the estimation process requires almost three times

less memory, although it takes three times more time. In fact,

the buffer size is not big enough for this algorithm to be time

efficient; however, it reduces the required space significantly.

Calculation EstimationTime Space Time Space

b=1024BSVM 5,428µs 5.1KB 16,421µs 1.6KBCART 5,658µs 5.3KB 18,751µs 1.85KB

b=32BSVM 326µs 195B

- -CART 276µs 192B

TABLE IIITIME AND SPACE FOR CLASSIFICATION

E. Classifier Delay Analysis

As shown in Fig. 1, Iustitia assigns a buffer of size b for

each incoming traffic flow, and once the buffer is full, it

calculates/estimates the flow’s entropy vector for classification.

To analyze classifier delay, we use a gigabit gateway trace from

the UMASS Trace Repository [27], [28] that has 11,976,410

packet records where 41.16 % of them are UDP or TCP data

packets. The packet rate in this trace is 146,714.38 packets per

second. This trace contains 299,564 data flows. Without appli-

cation header removal, we performed two sets of experiments

using buffer sizes 32 and 1024, respectively. With application

removal, we performed another two sets of experiments using

476 and 976 as the maximum application header lengths,

respectively; both sets of experiments use buffer size 1024.

Because we do not know the exact application header length

for each flow, we cut off all flows for the same number of

bytes uniformly by the maximum application header length

that we choose.

There are two types of classifier delays in Iustitia, post-

classification packet delay (PCPD) denoted by τ and pre-

classification flow delay (PCFD) denoted by τ ′. Post-

classification packet delay is the classifier delay for packets

that belong to a flow that has been already classified. Pre-

classification flow delay is the classifier delay for the buffered

packets that belong to a new flow. Based on Iustitia’s archi-

tecture shown in Fig. 1, PCPD comprises packet header hash

calculation delay (τhash ) and classification table (CT) query

delay (τCT ), whereas PCFD comprises buffer filling delay

(τb), entropy vector calculation delay (τEV ), and classification

delay (τc). PCPD and PCFD are calculated as follows:

τ = τhash + τCT (6)

τ ′ = τb + τEV + τc (7)

1) Post-Classification Packet Delay (PCPD): According to

Formula (6), PCPD is the sum of hash function delay τhashand CT search delay τCT . The hash calculation delay caused

by hash functions such as MD5 is relatively constant and less

than 15µs. The CT query delay is also constant as we use

hash tables to implement CT. Yet, to keep the CT memory

usage reasonable, we need to ensure that the CT does not

contain records for obsolete flows. In other words, the size of

CT (i.e., number of records) must be very close to the number

Page 8: An Information Theoretical Approach to High-Speed Flow

8

of actual flows passing through the classifier. To purge the CT

from obsolete TCP flows, once a packet with RST or FIN flag

is received, we need to remove its corresponding flow record

from CT. Fig. 9 shows that up to 46% of the TCP flows can

be removed using this technique. However, some applications

do not close their TCP sockets properly and furthermore UDP

flows do not have packets with FIN or RST flags. Hence,

we use the packet arrival time to detect and remove obsolete

flows. We assume that a flow Fi is obsolete if the following

condition holds: t − tFi≥ nλFi

, where t is the current time,

tFiis the arrival time of the last packet of the flow Fi, n is an

arbitrary coefficient, and λFiis the inter-arrival time from the

last two packets for the flow Fi. For implementation, we have

two options: (1) we can use default λ = 0.5s that applies to

all the flows. (2) For each flow (i.e., record in CT), we can

keep λFi. Hence, each record in CT contains 130 and 162 bits

using respectively options (1) and (2), containing 128 bits for

the MD5 hash as the key, 32 bits for λFi(for option (2)) and

2 bits for the flow label as the value. Note that λFimay vary

significantly for different packet traces and for different time

periods. This is because the packet inter-arrival time is tightly

dependent on wide range of packets inter-departure time for

different flow application protocol, network throughput and

available bandwidth, and network congestion. Therefore, it is

advised to choose between option (1) and (2) based on the

dynamics of the network and network flow statistics as well as

the Iustitia application. Fig. 10(a) and (b) show the cumulative

probability (CDF) of packet payload size and λF for our packet

trace, respectively.

0 300 600 900 1200 15000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Packet Payload Size (Bytes)

Cu

mu

lative

pro

ba

bili

ty (

CD

F)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Packet Interarrival Time (s)

(a) Cumulative Probability of packetpayload size

(b) Cumulative Probability of Packetinter−arrival time

Fig. 10. CDF of packet size and inter-arrival time

We also need to determine parameter n based on the given

time and space requirements. While using larger n values may

lead to larger CT as more obsolete records can be in CT, small

n may lead to unnecessary flow classification for the flows

that have been classified before but have been removed from

CT. Thus, considering the classification required space shown

in Table III, we are interested in the n values that are just

large enough to avoid reclassification of ongoing flows. Our

experimental results show that n = 4 is an optimal value for

the given traffic trace, yet that could be tuned differently for

other packet traces. Fig. 9 illustrates that the CT size after

pruning is fairly constant on 29,713 flows over time.

Fig. 11 shows the cumulative probability of average PCPD

per flow in our trace. This figure shows that up to 98% of the

packets has delay less than 20µs, which is significantly less

than average inter-arrival time for most of flows.

10 13 16 19 22 25 28 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

τ (microseconds)

Cu

mu

lative

Pro

ba

bili

ty (

CD

F)

Fig. 11. CDF of average post-classification packet delay per flow

2) Pre-Classification Flow Delay (PCFD): According to

Formula (6), PCFD is the sum of buffering delay (τb), the

entropy vector calculation/estimation delay (τEV ), and clas-

sification delay (τc). It is clear that τb can be calculated as

τb =∑c

j=1λj , where c is the number of required packets to

fill the buffer. Considering fixed buffer size b, the number

of packets is inversely proportional to the packet payload

size. The cumulative distribution of the packet payload size in

the given packet trace (shown in Fig. 10(a)) follows bimodal

packet size probability distribution, where up to 10% of the

packets have payload size of 1480 and more than 60% of

the packets have a payload size of less than 140 bytes. This

implies that the average number of required packets to fill

a buffer with b = 32 is almost 1, although it may increase

for larger buffers. Note that some buffers may not be filled

completely because the buffer timeout can be elapsed and

an incomplete buffer is sent for classification. To evaluate

PCFD and the total number of packets to fill a buffer, we

divide our packet trace into chunks of packet traces within 1

second time units. For each unit, we group the flows that are

initiated in each time unit and calculate the average number of

required packets and PCFD for each group of flows. Fig. 12(a)

shows the average number of packets before classification

where buffering timeout is 2 seconds. According to the figure,

the average number of packets before classification is 1.1 for

b = 32 and between 1.4 and 1.6 for larger buffers. Note that

the percentage of the number of buffers that are completely

filled before classification is 75% for b=32, and is less than

14% for larger buffers.

20 40 60 801

1.2

1.4

1.6

1.8

2

Time Units (seconds)

c

20 40 60 8010

−2

10−1

100

Time Units (seconds)

τ’ (s

econds)

buffer size=32

buffer size=1024

MHL=476, buffer size=1024

MHL=976, buffer size=1024

buffer size=32

buffer size=1024

MHL=476, buffer size=1024

MHL=976, buffer size=1024

(a) Average number of packets used forclassification

(b) Average total pre−classification delay

Fig. 12. Number of required packets to fill the buffer and the totalclassification delay for different buffer sizes

Page 9: An Information Theoretical Approach to High-Speed Flow

9

The entropy vector calculation/estimation delay (τEV ) has

been shown in Table III. The classification delay (τc) using

SVM and CART is almost a constant, which is less than 2µs.

This is trivial compared to τb and τEV . Fig. 13 shows the CDF

of Iustitia online classification delay (τEV + τc) over average

inter-arrival time for a flow. This graph shows that when b =32 and b = 1024, the online classification time delay for 94.9%

and 91.2% of the flows is less than 0.1 of average inter-arrival

time of the flows, respectively. Therefore, the buffering delay

(τb) is the dominant factor in PCFD in Iustitia. Note that in

Fig. 13, the SVM and CART graphs are almost the same,

which suggests that they induce same delay in the system.

Fig. 12(b) shows the pre-classification flow delay. This figure

clearly illustrates the τb domination. In this figure, for b = 32the τ ′ = 44ms± 13ms. However, for larger buffer sizes that

require more number of packets, we have τb > 230ms. Note

that Fig. 12(b) does not include the buffer timeout delay.

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

103

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(τEV

+τc) / Average_Interarrival_Time

Cum

ula

tive P

robabili

ty (

CD

F)

b=32 (SVM)

b=1024 (SVM)

b=32 (CART)

b=1024 (CART)

Fig. 13. CDF of ratio of classification delay and inter-arrival delay

F. Discussion

1) Classifying Multi-nature Flows: In practice, a flow may

include contents with different natures. For instance, a simple

mail transport protocol (SMTP) flow may include a text email

with an image attachment. The SMTP client can also turn on

the SSL capability and transmit the email message encrypted.

Although, such flows are considered binary flow based on our

flow nature definitions, they can be misclassified as only the

entropy vector of the beginning of the flow is calculated. Such

misclassification can also be abused by attackers to defraud

Iustitia, when it is used for security applications. Attackers

can add some deceiving padding in the beginning of a flow to

cause misclassification. For instance, if we prioritize different

output buffers such that binary flows have the highest priority

and encrypted flows have the lowest priority, an attacker may

put some encrypted-like padding to the beginning of a flow

and send it over to bypass complex signature matching.

For security applications and those applications that concern

about misclassification of multi-nature flows, one solution is

to periodically delete the CT record of a flow, which in

turn causes the flow to be reclassified. Since flows can have

different rates, the period is based on a certain amount of

payload bytes that has been received. We further look into

periodical classification of multi-nature flows and its accuracy

on real-life flows in section VI.

2) Network Tunneling: Network tunneling adds more com-

plication for flow classification. A tunnel may contain multiple

flows with different natures. If the tunnel is encrypted, we

classify the tunnel as an encrypted flow. If the tunnel is not

encrypted, it can be classified either as binary or multi-nature

flows. However, if we can strip off the tunnel header, we can

use Iustitia for every flow inside the tunnel and classify them

separately.

V. FINER GRAINED FLOW NATURE IDENTIFICATION

In this section, we investigate the finer-grained classification

of flows than the three broad categories of text, binary, and

encrypted. Specifically, for encrypted flows, we classify them

based on different encryption algorithms; for binary flows, we

first classify them into audio, video, image flows, executable,

compressed flows, and document flows. Even further, for

binary flows, we classify them based on the binary content

format; for example, we classify audio flows to categories such

as MP3, WAV, and WMA.

A. Encrypted Flows

We created a dataset of 13656 flows using 6 encryption

algorithms including Advanced Encryption Standard (AES),

Pretty Good Privacy (PGP), Data Encryption Standard (DES),

Blowfish (BF), Triple DES (DES3), and RC4. Fig. 14(a)

shows that the total classification accuracy based on the first

b bytes of each flow varies from 25% to 50%. If each of

these encryption algorithms outputs totally randomly for the

first b bytes, the classification accuracy would have been

1/6 = 16.6%. However, our classification accuracy is much

better than 16.6%. This implies that the distribution of byte

randomness for the beginning portion of encrypted flows is

not quite uniform. However, the descending trend of the graph

indicates that by using larger buffers, the distribution of byte

randomness becomes more uniform. To understand better the

distribution of byte randomness in encrypted flows, we repeat

our experiment using a random chosen consecutive of bytes

in the middle of each flow. Fig. 14(b) shows that the classi-

fication accuracy then varies from 16% to 25%. In summary,

we find that the beginning portion of encrypted flows has

more information than the middle portion for identifying the

encryption algorithms. These findings are quite interesting and

surprising. They may have applications in cryptanalysis and

digital forensics.

We further study the effectiveness of our method for each

encryption method. Fig. 15 shows that surprisingly flows

encrypted by AES and PGP can be classified with up to

99.9% accuracy. However, the classification accuracy is not

good enough for other encryption methods, although better

than random guessing, the classification accuracy of which

is 1/6 = 16.6%. The classification accuracy for randomly

choosing b bytes in the flow, instead of from the beginning, is

shown in Fig. 16. As expected, the classification accuracy is

much lower. Surprisingly, the classification accuracy for AES

flows is better than others.

Page 10: An Information Theoretical Approach to High-Speed Flow

10

SVM−RBF kernel Decision tree (CART)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1C

lassific

ation A

ccura

cy

SVM−RBF kernel Decision tree (CART)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

b’=8

b’=16

b’=32

b’=64

b’=128

b’=256

b’=512

b’=1K

b’=2K

b’=4K

b’=8K

b’=16K

b’=32K

b’=64K

b=8

b=16

b=32

b=64

b=128

b=256

b=512

b=1K

b=2K

b=4K

b=8K

b=16K

b=32K

b=64K

(a) Total classification accuracy for encryptedflows using first b bytes of the flow

(b) Total classification accuracy for encryptedflows using random b bytes of the flow

Fig. 14. Total classification accuracy for encrypted flows

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

AES

PGP

DES3

BF

DES

RC4

(b) Classification accuracy for each encryptionmethod using Decision Tree (CART)

(a) Classification accuracy for each encryptionmethod using SVM

Fig. 15. Classification accuracy for each encryption method using first bbytes of each flow

B. Binary Flows

We created a dataset of 15855 flows, the details of which

are shown in Table IV. For these binary flows, we first classify

them into audio, video, image flows, executable, compressed

flows, and document flows. We call this “binary flow type clas-

sification”. Then, we classify them based on the binary content

format. We call this “binary flow format classification”.

1) Binary Flow Type Classification: Fig. 17(a) shows that

the total accuracy ranges from 87% to 96% for SVM and

from 77% to 96% for CART. This high accuracy indicates that

Iustitia can be indeed useful for finer grained classification of

binary flows. The reason behind high classification accuracy

on very small buffer sizes is that files of the same formats often

have the similar application headers and often contain the same

file magic numbers [29]. The accuracy on larger buffer sizes

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

AES

PGP

DES3

BF

DES

RC4

(b) Classification accuracy for each encryptionmethod using Decision Tree (CART)

(a) Classification accuracy for each encryptionmethod using SVM

Fig. 16. Classification accuracy for each encryption method using randomb bytes of each flow

Dataset

Class Extension Format

ExecutablesN/A Linux Binaries.exe Windows Executable filesN/A Packed Executables

Image.jpg JPEG file.png Portable Network Graphics.gif Graphics Interchange Format

Documents .pdf Portable Document Format

Compressed.zip ZIP file.tar.gz GNU zip

Video.avi Audio Video Interleave.mpg MPEG-1 format

Audio.mp3 MPEG-1 Audio Layer 3.wav WAVE file (for windows).wma Windows Media Audio

TABLE IVDATASET DETAILS FOR BINARY FLOW CLASSIFICATION

is relatively higher. This implies that the beginning portion of

each binary format has its own almost unique byte randomness

distribution. Further, we investigate the classification accuracy

using the entropy vector of random b bytes in the flow.

Fig. 17(b) shows we can have up to 80% of total classification

accuracy by using SVM with relatively large buffer sizes of

b > 16KB.

SVM−RBF kernel Decision tree (CART)0

0.2

0.4

0.6

0.8

1

Cla

ssific

ation A

ccura

cy

b=32

b=64

b=128

b=256

b=512

b=1K

b=2K

b=4K

b=8K

b=16K

b=32K

b=64K

SVM−RBF kernel Decision tree (CART)0

0.2

0.4

0.6

0.8

1b’=32

b’=64

b’=128

b’=256

b’=512

b’=1K

b’=2K

b’=4K

b’=8K

b’=16K

b’=32K

b’=64K

(a) Total classification accuracy for binary flowsusing first b bytes of the flow

(b) Total classification accuracy for binary flowsusing random b’ bytes of the flow

Fig. 17. Total classification accuracy for binary flows

The flow classification accuracy for each flow type is shown

in Fig. 18. We observe that some binary flow types such as

videos and executables have very high accuracy of 96% to 99%

and 92% to 96%, respectively. These results not only suggest

that Iustitia is very effective for extracting executable flows

for security purposes and video flows for copyright validation,

but also imply that other binary flows can be classified with

relatively high accuracy (Images: 70% to 86%, Documents:

81% to 99.88%, Compressed: 57% to 97%, Audio:63% to

96%). Fig. 19 also shows that using large buffer sizes for

executable, document, compressed, and audio binary flows,

we can achieve a classification accuracy from 80% to 95%.

2) Binary Flow Format Classification: Fig. 20 shows clas-

sification accuracy results of popular formats of binary flows.

In this figure, the rows illustrate the accuracy results for each

specific type of flows, and the columns illustrate the accuracy

results of the classification using SVM and CART when the

entropy vector is calculated by the first b bytes and random bconsecutive bytes of each flow.

The first row of the figure is for executable flows. Figures

(a) and (b) show that the first b bytes of each flow can

be effectively used for classification (especially for small bvalues). This suggests that these flows may have specific and

well formatted header structures. Figures (c) and (d) also

show high accuracy results on random b bytes of each flow.

Page 11: An Information Theoretical Approach to High-Speed Flow

11

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)C

lassific

ation A

ccura

cy

Linux Exec.

Windows Exec.

Packed Exec.

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

JPEG

PNG

GIF

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

GZIP

ZIP

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

AVI

MPEG

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

MP3

WAV

WMA

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

32 64 128 256 512 1K 2K 4K 8K 16K32K64K0

0.2

0.4

0.6

0.8

1

Buffer Size (bytes)

(a) Classification accuracy for executables usingSVM and first b bytes of the flow

(b) Classification accuracy for executables usingCART and first b bytes of the flow

(c) Classification accuracy for executables usingSVM and random b bytes of the flow

(d) Classification accuracy for executables usingCART and random b bytes of the flow

(e) Classification accuracy for images using SVMand first b bytes of the flow

(f) Classification accuracy for images using CARTand first b bytes of the flow

(g) Classification accuracy for images using SVMand random b bytes of the flow

(h) Classification accuracy for images using CARTand random b bytes of the flow

(i) Classification accuracy for compressed usingSVM and first b bytes of the flow

(j) Classification accuracy for compressed usingSVM and first b bytes of the flow

(k) Classification accuracy for compressed usingSVM and random b bytes of the flow

(l) Classification accuracy for compressed usingCART and random b bytes of the flow

(m) Classification accuracy for videos using SVMand first b bytes of the flow

(n) Classification accuracy for videos using SVMand first b bytes of the flow

(o) Classification accuracy for videos using SVMand random b bytes of the flow

(p) Classification accuracy for videos using CARTand random b bytes of the flow

(q) Classification accuracy for audios using SVMand first b bytes of the flow

(r) Classification accuracy for audios using SVMand first b bytes of the flow

(s) Classification accuracy for audios using SVMand random b bytes of the flow

(t) Classification accuracy for audios using CARTand random b bytes of the flow

Fig. 20. Classification accuracy for different application-specific extension for executable, image, compressed, video, and audio flows

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

Exec

Image

Doc

Comp

Video

Audio

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

(b) Classification accuracy for each binaryflow type using Decision Tree (CART)

(a) Classification accuracy for each binaryflow type using SVM

Fig. 18. Classification accuracy for each binary flow type using first b bytesof the flow

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

Cla

ssific

ation A

ccura

cy

Exec

Image

Doc

Comp

Video

Audio

32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (bytes)

(a) Classification accuracy for each binaryflow type using SVM

(b) Classification accuracy for each binaryflow type using Decision Tree (CART)

Fig. 19. Classification accuracy for each binary flow type using random bbytes of the flow

This shows that Iustitia can be effectively used to distinguish

between different executable flows. This is very important

because executable flow classification can be very useful for

detecting exploits and packed executables.

The second row of the figure is for image flows. Looking at

the accuracy results of Figures (e-h), the first b bytes of each

flow is more informative than random b bytes in each flow.

They also suggest that JPEG and GIF files can be detected

with relatively higher accuracy rather than PNG files in most

of the cases.

The third row of the figure is for compressed flows. From

the results in Figures (i-l), it is clear that using the first b bytes

of each flow is preferred over using random b bytes of each

flow. We further observe that the distribution of the frequency

of the bytes in GZIP and ZIP format is almost the same. This

explains the low accuracy results in figures (k) and (l).

The forth row of the figure is for video flows. Figures (m-p)

show that by using the first b of each flow, we can achieve

almost 100% accuracy. However, the accuracy is not as high

if we use random b bytes of each flow. In cases where b > 8,

the accuracy can go up to 90% for both AVI and MPEG flows.

The last row of the figure is for audio flows. Figures (q-t)

show that MP3 and WAV format can be classified with more

than 95% accuracy in most of the cases. WMA format, on the

other hand, requires relatively larger buffer sizes to achieve

high accuracy. Moreover, in general, the figures show that

using the first b of each flow is preferred over using random

b bytes of each flow for audio flow formats.

Page 12: An Information Theoretical Approach to High-Speed Flow

12

VI. EVALUATION ON REAL-LIFE TRACES

In this section we evaluate the accuracy of Iustitia on real-

life traffic traces. Note that the application protocol informa-

tion is not used for classification. Although such information

can be used to increase the accuracy of Iustitia, it is out of

the scope of this paper.

A. Traffic Trace Collection

Using available traffic traces to evaluate Iustitia is not

practical due to obvious privacy issues. To have a labeled

dataset as the ground truth, for a given set of flows we are

required to look into packet payloads and label flows manually.

Since this clearly violates users privacy and it is not approved

by Institutional Review Board (IRB), no accredited traces

with payload are publicly available. Moreover, distinguishing

between binary and encrypted flows by looking at the payload

manually, if it is not infeasible it is extremely inaccurate.

Hence, to evaluate Iustitia on real-life traces we first collected

our personal traces captured locally during a week. We also

built a testbed to collect traces for some unpopular services

and protocols. For instance, for web service protocols such as

HTTP and HTTPS flows, in addition to personal web browsing

traces, we use websphinx web crawler [30] to collect more

traces for the web protocols. For mail service protocols such

as IMAP, SMTP, and IMAPS/POP3S/SMTPS, in addition to

personal mail transactions, we built a mail server to send and

receive emails. For video conferencing services such as Skype

and oovoo [31], we made and received calls and collected the

traces. For remote desktop protocols such as RDP and VNC

[32] (using Remote Frame Buffer(RBF) protocol), we installed

a client and server on our testbed to generate traffic and collect

traces. For file transfer protocols such as TFTP and FTP, and

SFTP/SCP, we downloaded and uploaded some known files

and collect traces. Finally, for remote shell protocols such

as telnet and SSH, we installed a server and a client on

the testbed and collect traces. We then labeled known flows

accordingly based on their application layer protocol, and for

others such as HTTP and mail services we used the message

Multipurpose Internet Mail Extensions (MIME) extensions for

labeling. Table V shows the statistics of 6,409 flows including

1,556 text, 2,425 binary, and 2,428 encrypted flows collected

from our traces and their corresponding labels. Note that as

mentioned before, multi-nature flows are labeled as binary in

coarse-grained classification.

B. Evaluation Results

1) Classification Accuracy with No Application Header

Removal: We first evaluate Iustitia on real-life flows with

no application header removal. We first train classifier by

the first b bytes of files and use them to classify all single-

nature flows. Second, we use 10 times cross-validation on

real-life flows to evaluate classification accuracy. Fig. 21 and

Fig. 22 show the classification accuracy for different buffer

sizes and for different protocols, respectively. Fig. 21(a) shows

that classification accuracy is relatively low varying from

60% to 70%, when we train the classifier using the first b

bytes of files. Fig. 21(b) shows that classification accuracy is

substantially improved varying from 75% to 90%, when we

train the classifier using the first b bytes of flows.

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (Bytes)

Cla

ssific

ation A

ccura

cy

8 16 32 64 128 256 512 1K 2K 4K 8K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Buffer Size (Bytes)

CART

SVM−RBF kernel

(a) Classification accuracy by training with firstb bytes of files

(b) Classification accuracy by training with firstb bytes of real−life flows

Fig. 21. Classification accuracy of real-life flows with no header removal

Fig. 22 shows the classification accuracy per flow protocol

when the buffer size is 1KB. Fig. 22(a) indicates that while

some protocols such as SMTP, RDP, Skype, and SFTP are

classified with high accuracy when the classifier is trained by

the first 1K bytes of files, IMAPS/POP3S/SMTPS and telnet

have very low accuracy. On the other hand, Fig. 22(b) shows

that if we train the classifier using the first 1K bytes of flows,

the results significantly increases, yet telnet still have very

low accuracy rate. Looking into telnet traces, we realized

that even though telnet is labeled as text flows based on

our definition, telnet traces includes many control characters

that causes misclassification to binary flows. Because of such

encoding differences for some application protocols (such as

interactive text and binary sessions like telnet, and video

conferencing), to achieve highest accuracy, it is very important

to have enough data points for a wide range of application

protocols for training.

https http imaps imap smtp oovoo rdp vnc skype tftp telnet sftp ftp0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Flow Protocol

Cla

ssific

ton A

ccura

cy

https http imaps imap smtp oovoo rdp vnc skype tftp telnet sftp ftp0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Flow Protocol

Cla

ssific

ton A

ccura

cy

CART SVM−RBF kernel

(a) Classfication accuracy for each flow protocol by training with first 1K bytes of files

(b) Classfication accuracy for each flow protocol by training with first 1K bytes ofreal−life flows

Fig. 22. Classification accuracy of real-life flows with no header removal

2) Classification Accuracy with application header re-

moval: Next, we evaluate Iustitia on real-life flows when the

application header is cut off. As the application header size

changes for different protocols, we evaluate Iustitia on real-

life flows when the cut-off threshold changes from 8 bytes

to 8K bytes. Figure 23(a) shows the classification accuracy

for buffer size of 1K bytes, when the classifier is trained by

random 1K bytes of files. Figure 23(b) shows the classification

Page 13: An Information Theoretical Approach to High-Speed Flow

13

Application Layer Transport Number of flows per label ServiceProtocol Protocol Total text binary % of multi-nature encrypted Description

HTTP TCP 1,459 1,006 453 50.33% - webHTTPS TCP 238 - - - 238 secure webIMAP TCP 66 48 18 99% - mailSMTP TCP 220 43 177 98% - mailIMAPS/POP3S/SMTPS TCP 75 - - - 75 secure mailoovoo UDP 311 - 311 0% - voice/video chatRDP TCP 1,366 - - - 1,366 remote desktopVNC TCP 1,250 - 1,250 0% - remote desktopSkype UDP 466 - - - 466 secure voice/video chatTFTP UDP 19 14 5 0% - file transferFTP TCP 332 120 212 0% - file transferTelnet TCP 325 325 - - - shellSSH TCP 33 - - - 33 secure shellSFTP/SCP TCP 250 - - - 250 secure file transfer

TABLE VCOLLECTED REAL-LIFE FLOW TRACES DESCRIPTIONS AND STATISTICS

accuracy when the classifier is trained by 1K bytes of flows

after the header cut-off threshold. The results indicate that

when we use real-life flows for training, the classification

accuracy increases by 31.01% and 23.9% to 87.87% and

88.27% on average for CART and SVM, respectively. The

results also indicate that there is no certain cut-off threshold

that maximizes the accuracy for all the flows. The reason

can be stated in threefold. First, the application header can

cause misclassification for not all of the flows but only for

a subset of the flows that (1) are binary or encrypted, (2)

have different nature application header (e.g., text), and (3)

the application header is relatively large comparing to the

size of the buffer. Second, the size of the application header

varies in a wide range for different applications. Even for one

specific application the size of header can change dramatically.

For instance, the size of a cookie in Set-Cookie in HTTP

header can be as large as 4KB per cookie. Third, depending

on classification algorithm we use, the application header can

be helpful for classification. As the header for the flows from

same applications could be very similar, their randomness of

the beginning of the flow are almost the same, which in turn

can help to increase the classification header.

8 32 128 512 1024 4096 81920

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Header Cut−off Threshold Size (Bytes)

Cla

ssific

ation A

ccura

cy

8 32 128 512 1024 4096 81920

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Header Cut−off Threshold Size (Bytes)

CART

SVM−RBF kernel

(a) Classification accuracy by training withrandom 1K bytes of files

(b) Classification accuracy by training with 1Kbytes after threshold (T) for real−life flows

Fig. 23. Classification accuracy of real-life flows with header removal

Figure 24 shows the classification accuracy per flow with

application header removal. The results in Fig. 24(a) and (b)

show that by training the classifier with first 1K bytes of files

after a certain cut-off threshold, some flows are classified with

very high accuracy such as SMTP (text flows), oovoo, RDP,

and Skype, but some flows are classified with low accuracy

such as IMAPS/POP3S/SMTPS, IMAP and VNC. However,

the results in Fig. 24(c) and (d) show that by training the

classifier with the first 1K bytes after cut-off threshold, the

accuracy for almost all the flows increases except telnet. This

means that the first classifier is better trained to classify telnet

flows. Besides the telnet flows, the rest of flows has very high

accuracy for all application header cut-off thresholds using

the second classifier. Although using 8K cute-off threshold

we can achieve the highest total accuracy, using other cut-off

thresholds lead to high accuracy as well.

8 32 128 512 1024 4096 81920

0.10.20.30.40.50.60.70.80.9

1

Header Cut−off Thresold (bytes)

Cla

ssifia

ton A

ccura

cy

8 32 128 512 1024 4096 81920

0.10.20.30.40.50.60.70.80.9

1

Header Cut−off Thresold (bytes)

Cla

ssifia

ton A

ccura

cy

8 32 128 512 1024 4096 81920

0.10.20.30.40.50.60.70.80.9

1

Header Cut−off Thresold (bytes)

Cla

ssifia

ton A

ccura

cy

8 32 128 512 1024 4096 81920

0.10.20.30.40.50.60.70.80.9

1

Header Cut−off Thresold (bytes)

Cla

ssifia

ton A

ccura

cy

https

http

imaps

imap

smtp

oovoo

rdp

vnc

skype

tftp

telnet

sftp

ftp

(b) Classification accuracy with training by first 1KB after cutt−off point of files for each flow application protocol using SVM

(c) Classification accuracy with training by first 1KB after cutt−off point of real−life flows for each flow application protocol using CART

(d) Classification accuracy with training by first 1KB after cutt−off point of real−life flows for each flow application protocol using SVM

(a) Classification accuracy with training by first 1KB after cutt−off point of filesfor each flow application protocol using CART

Fig. 24. Classification accuracy of real-life flows with header removal

3) Classification Accuracy on multi-nature flows: As ex-

plained in section IV-F1, flows may include contents with

different natures. To classify such flows based on their content

nature, we periodically buffer a flow content and reclassify

the flow based on the buffer content. Hence, to evaluate

classification accuracy, we divide a given flow into windows

and we consider each window as a new flow. To evaluate

Iustitia accuracy on multi-nature flows, we first label each

window based on its content to build our dataset. As each

window can include contents with multiple nature, we label

Page 14: An Information Theoretical Approach to High-Speed Flow

14

the window based on the nature of the content that is contained

in the most of the window. We then use cross-validation to

evaluate Iustitia classification accuracy on different window

sizes. The buffer size in these experiments is 1K bytes.

Fig. 25(a) shows the accuracy results when the classifier is

trained by random 1K of files. The accuracy for different

window sizes is, on average, 50.33% and 50.84% for CART

and SVM, respectively. Fig. 25(b) shows the accuracy results

when the classifier is trained by the first 1K each window. The

accuracy for different window sizes is significantly improved

to be 88.97% and 87.12%, on average, for CART and SVM,

respectively. The window size does not have significant impact

on the accuracy results. It can be chosen by the operator based

on the Iustitia application and the classification system load

and user demand to guarantee a good service and reliable

classifier.

1K 2K 4K 8K 16K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample Window Size (Bytes)

Cla

ssific

ation A

ccura

cy

1K 2K 4K 8K 16K0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sample Window Size (Bytes)

CART

SVM−RBF kernel

(a) Multi−natural flow classification accuracy bytraining with random 1K bytes of files

(b) Multi−natural flow classification accuracy bytraining from multi−natural real−life flows

Fig. 25. Classification accuracy of multi-nature real-life flows

VII. CONCLUSION

We make three major contributions in this paper. First,

we propose the first method for classifying flows into text,

binary, and encrypted at high-speed with a small amount of

memory. Our method is mainly based on machine learning and

information theory techniques. Second, we extend our method

to perform finer grained classification of network flows and

achieve high classification accuracy. Third, we implemented

our method and performed extensive experiments.

REFERENCES

[1] “Internet2 NetFlow Statistics,” http://netflow.internet2.edu/.[2] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic

feature distributions,” SIGCOMM Comput. Commun. Rev., vol. 35, pp.217–228, 2005.

[3] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide trafficanomalies,” in Proc. SIGCOMM, 2004.

[4] W. Lee and D. Xiang, “Information-theoretic measures for anomalydetection,” in Proc. IEEE S&P, 2001.

[5] Y. Zhang and V. Paxson, “Detecting backdoors,” in Proc. USENIX

Security, 2000.[6] H. Dreger, A. Feldmann, M. Mai, V. Paxson, and R. Sommer, “Dynamic

application-layer protocol analysis for network intrusion detection,” inProc. 15th Conf. on USENIX Security Symposium, 2006.

[7] A. Dainotti, W. de Donato, and A. Pescape, “TIE: A Community-Oriented Traffic Classification Platform,” in Data Traffic Monitoring and

Analysis (TMA), 2009.[8] S. Lee, S. Lee, H. Kim, D. Barman, C.-K. Kim, T. T. Kwon, and

Y. Choi, “NeTraMark: A Network Traffic classification benchMark,” inACM SIGCOMM Computer Communication Review, 2011.

[9] J. P. Early, C. E. Brodley, and C. Rosenberg, “Behavioral authenticationof server flows,” in Proc. 19th Annual Computer Security Applications

(ACSAC) Conf. , 2003.[10] C. V. Wright, F. Monrose, and G. M. Masson, “On inferring application

protocol behaviors in encrypted network traffic,” Journal of Machine

Learning Research, vol. z, pp. 2745–2769, 2006.

[11] P. Dorfinger, G. Panholzer, and W. John, “Entropy estimation for real-time encrypted traffic identification,” in Proc. of TMA, 2011.

[12] R. Alshammari and A. N. Zincir-Heywood, “Can encrypted traffic beidentified without port numbers, ip addresses and payload inspection?”Computer Networks, vol. 55, no. 6, pp. 1326 – 1350, 2011.

[13] M. McDaniel and M. H. Heydari, “Content based file type detectionalgorithms,” in Proc. 36th Annual Hawaii Int. Conf. on System Sciences,2003, pp. 323–333.

[14] W.-J. Li, K. Wang, S. J. Stolfo, and B. Herzog, “Fileprints: identifyingfile types byn-gramanalysis,” in Proc. 2005 IEEE workshop on informa-

tion assurance,, 2005.[15] A. Finamore, M. Mellia, M. Meo, and D. Rossi, “Kiss: Stochastic packet

inspection classifier for udp traffic,” IEEE Transactions on Networking,vol. to apear, 2010.

[16] Y. Wang, Z. Zhang, L. Guo, and S. Li, “Using entropy to classifytraffic more deeply,” in Proc. IEEE International Conf. on Networking,

Architecture, and Storage, 2011.[17] K. Sayood, Introduction to Data Compression. Morgan Kaufmann,

1996.[18] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and

Regression Trees. Chapman & Hall, 1984.[19] V. N. Vapnik, The nature of statistical learning theory. Springer-Verlag

New York, Inc., 1995.[20] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large margin DAGs for

multiclass classification,” in Advances in Neural Information Processing

Systems. MIT Press, 2000.[21] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass

support vector machines,” IEEE Transactions on Neural Networks,vol. 13, pp. 415–425, 2002.

[22] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Tran.

on Information Theory, vol. 37, pp. 145–151, 1991.[23] P. Somol, P. Pudil, J. Novovicova, and P. Paclik, “Adaptive floating

search methods in feature selection,” Pattern Recognition Letter, vol. 20,pp. 11–13, 1999.

[24] J. Postel and J. Reynolds, “File transfer protocol,” RFC 959, 1985.[25] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang, “Data streaming

algorithms for estimating entropy of network traffic,” in Proc. SIGMET-

RICS ’06, 2006, pp. 145–156.[26] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approx-

imating the frequency moments,” in Proc. ACM symposium on theory

of computing, 1996.[27] “Umass Trace Repository, http://traces.cs.umass.edu/.”[28] K. Suh, Y. Guo, J. Kurose, and D. Towsley, “Locating network monitors:

complexity, heuristics, and coverage,” in Proc. INFOCOM 2005, vol. 1,pp. 351–361, 2005.

[29] “File Magic Number,” http://www.astro.keele.ac.uk/oldusers/rno/Computing/File magic.html.

[30] R. C. Miller and K. Bharat, “Sphinx: A framework for cre-ating personal, site-specific web crawlers,” In Proc. of WWW

(http://www.cs.cmu.edu/ rcm/websphinx/), 1998.[31] “ooVoo LLC.” www.oovoo.com.[32] “Virtual Network Computing (VNC),” www.realvnc.com.

Amir R. Khakpour received his BS in ElectricalEngineering from University of Tehran in 2005 andMSc in Computer and Communication Networksfrom Telecom SudParis in 2007. He is graduatedwith a Ph.D. degree in Computer Science fromMichigan State University in 2012. His research in-terests include Internet, network security, and large-scale systems.

Alex X. Liu received his Ph.D. degree in computerscience from the University of Texas at Austin in2006. He received the IEEE & IFIP William C.Carter Award in 2004 and an NSF CAREER awardin 2009. He received the Withrow DistinguishedScholar Award in 2011 at Michigan State University.His research interests focus on networking, security,and dependable systems.