an information theoretical approach to high-speed flow
TRANSCRIPT
1
An Information Theoretical Approach to
High-Speed Flow Nature IdentificationAmir R. Khakpour Alex X. Liu∗
Abstract—This paper concerns the fundamental problem ofidentifying the content nature of a flow, namely text, binary, orencrypted, for the first time. We propose Iustitia, a frameworkfor identifying flow nature on the fly. The key observationbehind Iustitia is that text flows have the lowest entropy andencrypted flows have the highest entropy, while the entropy ofbinary flows stands in between. We further extend Iustitia forthe finer-grained classification of binary flows so that we candifferentiate different types of binary flows (such as image, video,and executables) and even the file formats (such as JPEG andGIF for images, MPEG and AVI for videos) carried by binaryflows. The basic idea of Iustitia is to classify flows using machinelearning techniques where a feature is the entropy of every certainnumber of consecutive bytes. Our experimental results show thatthe classification can be done with high speed and high accuracy.On average, Iustitia can classify flows with 88.27% of accuracyusing a buffer size of 1K with a classification time of less than10% of packet inter-arrival time for 91.2% of flows.
Index Terms—Flow Content Analysis, Flow Identification.
I. INTRODUCTION
A. Motivation
In this paper, we aim to identify the content nature of flows
at high speed. We define the nature of a flow as text, encrypted,
or binary. A flow is text if and only if the entire flow content
consists of text in a natural language and is human readable
(e.g., printable ASCII subset, text/plain MIME type). Typical
text flow content includes plain-text HTML pages, email, chat,
telnet, source codes, etc. A flow is encrypted if and only if the
entire flow content consists of cipher-text. An encrypted flow
mostly consists of traffic over SSL, encrypted files, etc. A flow
is binary if and only if it is neither text not encrypted. A binary
flow mainly consists of binary content, such as executable
code, multimedia files, etc. We further classify binary flows
into their type (e.g., image, video, and executable) and also
the file format (e.g., JPEG and GIF for images, MPEG and
AVI for videos). Fast identification of flow nature can enable
a variety of applications in networking. Below we give a few
examples.
∗ Alex X. Liu is the corresponding author of this paper. Email:[email protected]. Tel: +1 (517) 353-5152. Fax: +1 (517) 432-1061.
Amir R. Khakpour and Alex X. Liu are with the Department of ComputerScience and Engineering, Michigan State University, East Lansing, MI, 48824.The preliminary version of this paper titled “Iustitia: An Information Theo-retical Approach to High-speed Flow Nature Identification” was publishedin ICDCS 2009. This work is supported in part by the National ScienceFoundation under Grant Numbers CNS-0716407, CNS-0916044, and CNS-0845513. This work is also supported in part by National Natural ScienceFoundation of China (Grant No. 61272546).
E-mail: {khakpour, alexliu}@cse.msu.eduManuscript received July 9th, 2010.
Network monitoring and management. High-speed flow
nature identification allows ISPs to prioritize their traffic for
better quality of service. Considering an ISP serving a bank
and a call center, among the traffic to/from the bank network,
the ISP may give higher priority to the encrypted flows because
they most likely carry banking transactions. Among the traffic
to/from the call center, the ISP may give higher priority to
the binary flows because they most likely carry voice data.
High-speed flow nature identification also allows ISPs to
gather various statistics on traffic passing through a network.
Currently, up to 40% of Internet traffic belongs to unknown
applications [1]. Identifying the nature of flows helps ISPs
to understand better the type of traffic passing through their
networks and to manage their networks better.
Forensics. High-speed flow nature identification allows ISPs
to monitor and efficiently log their traffic for law enforcement
purposes when appropriate. For example, identifying binary
flows may help copyright enforcement as they may carry
copyrighted software and multimedia. As another example,
identifying text flows may allow law enforcement to perform
complex keyword searching for finding possible human com-
munications on the fly.
Performance improvement for Intrusion Detec-
tion/Prevention Systems (IDS/IPS). With the rapid growth of
traffic demands and the increasing complexity, IDSes/IPSes
are becoming performance bottlenecks. When performing
deep packet inspection (DPI) on each packet cannot meet
the performance demand of users, high-speed flow nature
identification allows an IDS/IPS to apply binary related attack
signatures on binary flows and text related attacks signatures
on text flows, which is more efficient than applying all
signatures on all flows.
Content-based switching, caching, and filtering. The current
techniques for content switching, caching and filtering are ei-
ther port-based or they require deep packet inspection to parse
the flows application headers. The port-based techniques are
inaccurate as Internet services can operate on any port rather
then their well-known ports and different types of contents can
be delivered by a protocol. The DPI-based techniques, on the
other hand, are computationally expensive and may not satisfy
the performance demand of users. Moreover, these techniques
are not useful if the content type and format are not stated
on the flow application header. High-speed nature and content
format identification can indeed improve these techniques for
higher accuracy and better performance. For instance, we
can distinguish between encrypted messages, movies, images,
executables, and chat messages in a fast and efficient way.
2
B. Technical Challenges
Although identifying flow nature is a fundamental net-
working problem, to our best knowledge, it has not been
investigated in prior literature. High-speed flow nature identi-
fication is a technically challenging problem. First, flow nature
identification cannot rely on well-known port numbers. Any
port number can be potentially customized for any application.
Thus, flow nature identification has to rely on examining
packet payload. Second, examining packet payload at high
speed is difficult for high bandwidth links.
C. Our Information Theoretical Approach
In this paper, we propose Iustitia1, a fast content-based flow
classifier that classifies flows into text, binary, and encrypted.
When a flow is started, Iustitia quickly identifies its nature
based on a small number of buffered packet payload and
forwards it to its corresponding buffer. The key observation
behind Iustitia is that text flows have the lowest entropy and
the encrypted flows have the highest entropy, while the entropy
of binary flows stands in between. The basic idea of Iustitia is
to classify flows using machine learning techniques where a
feature is the entropy of every certain number of consecutive
bytes.
The architecture of Iustitia is shown in Fig. 1. The classifier
has a Classification Table (CT) that contains flow IDs with
their corresponding class label. Upon arrival of a packet,
the classifier calculates the hash of the packet header to
identify its flow ID. If the flow ID is found in CT, the
flow splitter forwards the packet to its corresponding output
queue. Otherwise, the packet is buffered in its corresponding
queue. When the buffer of a flow is full, a set of entropy-
based classification features are extracted, where a feature is
the entropy of every certain number of consecutive bytes.
These extracted features are then fed into the classification
module. The output of the classification module is a label of
text, binary, or encrypted for coarse-grained classification, and
other binary sub-labels for finer-grained classification. When
the nature of a flow is identified, the flow ID along with its
corresponding label are stored in CT. Flows carried by TCP
are removed from CT when FIN or RST packets are seen.
Flows in CT are also removed when they are inactive for a
certain amount of time. The classification module is trained
offline using labeled data points extracted from known flows.
As Internet traffic evolves over time, it is important to retrain
the classifier frequently to keep it updated for achieving high
accuracy classification.
We further use Iustitia for finer-grained classification of
binary and encrypted flows. We first classify binary flows
into executables, images, documents, compressed, video, and
audio. Second, we further classify binary flows based on
application-specific extensions. For instance, we classify im-
age flows further into the categories of JPEG, GIF, PNG, etc.
For encrypted flows, we further classify them based on their
encryption algorithms, namely AES, PGP, DES3, BF, DES,
and RC4.
1Iustitia (aka Lady Justice) is the Roman goddess of Justice, depicted andsculptured mostly blindfolded with a sword and measuring balances.
Learning using
Decision Tree/
Support Vector
Machine (SVM)
Classification
Text
Binary
Encrypted
Training
Set
Testing
Set
Decision Tree
Model
Support Vectors
(SVs)
Header
Hash
Calculator
(fid)
Is fid in
CT?
Flow Splitter
Entropy Vector
Calculator/
Estimator
Get flow
label from
CT
Classification
Table (CT)
Update CTQuery
Output Queues
Binary
Encrypted
Text
yes
no
Online Process Offline Process
LQ1
LQ2
fid ck
2 1
fid1fid2
Remove
Application
Layer Header
Packetsother labels
Fig. 1. Iustitia Architecture
II. RELATED WORK
To our best knowledge, there is no prior work on classifying
flows as text, binary, and encrypted. Prior work on traffic
classification for anomaly detection purposes mainly focuses
on the inspection of packet headers and the detection of
different network attacks using the statistical features of packet
header fields [2], [3], [4]. The anomalies that can be detected
and identified with such techniques are limited to attacks that
can be detected based on packet headers, such as port scans,
DOS, etc. These techniques are unable to discern attacks that
cannot be detected solely based on packet headers such as
malware, viruses, etc. Prior work on packet payload classifica-
tion using machine learning techniques focuses on identifying
what application the flow is associated with or whether a
flow contains code. Signature-based approaches include the
schemes for detecting well-known application protocols using
packet size, timing characteristics, and packet payload [5], [6],
[7], [8], [9]. TIE [7] and NeTraMark [8] are two examples
of traffic classification platforms and benchmarks for traffic
classification to classes such as web, FTP, DNS, games,
file sharing, etc. Machine learning and other classification
techniques have also been used to infer application protocols
from encrypted flows [10], [11], [12]. Yet, these techniques
are orthogonal to Iustita as they classify flows based on
their application protocol rather than their nature. Iustitia is
orthogonal to that line of work as it uses the randomness of
a flow payload to identify the nature of the flow. Thus, the
domain of applications of such work is different from Iustitia.
Some work has been done on file type identification using
file “fingerprints” (or called “fileprints”) [13], [14]. The fin-
gerprint of a set of files with same file type is the histogram
of byte frequencies along with their variances. Given all file
type fingerprints, an unknown file is identified based on its
Mahalanobis distance to a file type fingerprint. These file type
identification algorithms cannot solve our problem because we
cannot buffer all packets in a flow to make it a file.
Recently, Finamore et al. proposed KISS, a method for
identifying application protocol types based on packet payload
[15]. KISS is designed specifically for UDP traffic and uses
chi-square like test to extract payload statistical features. Wang
et al. tested Iustita on 8 different real-life datasets in [16]
and yield similar results reported in this paper. They further
break one byte into two 4-bits and use Iustita to improve the
classification accuracy on encrypted traffic.
3
III. FILE CLASSIFICATION USING ENTROPY VECTOR
In this section, we first show how the randomness of a file
can be represented by an entropy vector. Second, we show how
we can change the entropy vector such that it can be computed
on-the-fly. Third, we present two hypotheses, which will be
used as the basis of Iustitia. Forth, we show to what level of
accuracy these hypotheses are true and how file classification
can be useful for flow classification.
A. Measuring Randomness Using Entropy
According to information theory, the entropy of a se-
quence of m elements, where each element is in set S ={e1, e2, . . . , en}, is defined as −
∑n
i=1
mi
mlog(mi
m), where mi
is the number of occurrences of element ei . In this paper,
we use normalized entropy where the base of logarithm is nand the unit of entropy is element per symbol. Note that we
assume 0 log 0 = 0. The entropy of a sequence of elements
is a measure of the diversity or randomness of the sequence.
The minimum entropy value is 0 if all m elements in the
sequence are the same, and the maximum entropy value is 1
if all n elements occur the same number of times (i.e., m/n)
in the sequence. For simplicity, we use “entropy” to mean
“normalized entropy” in the rest of the paper.
To compute the entropy of the content of a file F with
size m, we first consider F as a Markov information source
where each state (i.e., element)is a character. We then rep-
resent the entropy of the content of F by an entropy vector
H(F ) = 〈ℏ1, · · · , ℏm〉, where ℏi is the entropy of the file that
is modeled as (i−1)-order Markov information source. Hence,
ℏi is calculated as follows,
ℏi = −∑
a1∈S
P(a1 )∑
a2∈S
P(a2 |a1 ) · · ·∑
ai∈S
P(ai |a1 · · · ai−1 ) log P(ai |a1 · · · ai−1 )
(1)
where ai is a state (i.e., an specific character), P (a1) is the
probability of occurrence of a1 (i.e., P (a1) = P (a1 = ej) =mj/m), and P (a2|a1) is the probability of occurrence of a2given occurrence of a1 as the preceding character (i.e., P (a2 =ej |a1 = el) = mjl/ml) [17].
Computing ℏi using formula 1 has time and space complex-
ity of O(mi); therefore, the complexity of computing H(F )is O(mm). Due to exponential complexity of computing an
entropy vector, it is very difficult to measure randomness of
the file (or a flow) using the entropy vector on-the-fly.
B. Measuring Randomness on-the-fly
We measure the randomness of a file content in quadratic
time and space complexity (O(m2)) by overlooking the condi-
tional probability of elements and calculate entropy assuming
the elements are independent. Hence, given a file F , we can
treat each byte as an element and F as a multiset of such
elements, and henceforth calculate the entropy of F over the
set of all 28 possible bytes. In general, we can treat each
consecutive k bytes in file F as an independent element and
F as a multiset of such elements, then calculate the entropy
of F over the set of all possible 28k sequences of k bytes. For
example, given F = 〈a, b, c, d〉, we can treat each consecutive
2 bytes as an element and F as a multiset {ab, bc, cd}, and
then calculate the entropy of F over the set of all possible 216
sequences of two bytes. Let fk denote the set of all possible
28k sequences of k bytes, and hk denote the entropy of a given
file F of m bytes, which is deemed as a multiset of sequences
of k bytes over set fk. We calculate hk using Formula (2)
where mik is the number of occurrences of fk’s i-th element
in F . Note that |F | =∑|fk|
i=1mik = m−k+1 and |fk| = 28k.
hk = −
|fk|∑
i=1
mik
m− k + 1log|fk|(
mik
m − k + 1)
=1
m− k + 1[∑
i
mik log(m − k + 1) −∑
i
mik logmik ]
= log(m − k + 1)−1
m− k + 1
∑
i
mik logmik (2)
In this paper, we calculate hi using formula 2 instead of ℏiusing formula 1 for entropy vectors. For the classification, each
hi is treated as a feature of F , where we call i the feature width
of hi. Accordingly, the new entropy vector is defined some of
F ’s unique features in linear time and space complexity.
C. Hypotheses and Validation
We make the following two hypotheses, which will serve
as the basic of our flow classification scheme: (1) Using file
entropy vectors, we can classify files into text, encrypted, and
binary files. (2) The randomness of the beginning portion of
a file represents the randomness of the entire file.
To validate these two hypotheses, we collected a pool of
three classes of files: 52,273 binary files (including executa-
bles, JPG, GIF, AVI, MPG, PDF, ZIP files), 24,985 text files
(including text documents, manuals, txt files, log files, htmls),
and 13,656 encrypted files (generated using PGP, AES, DES,
etc). We validated hypothesis 1 using 10 times cross-validation
over these files. Each cross-validation uses 6000 files equally
drawn from each class. For each file, we calculated an entropy
vector of size 10 from h1 to h10. Fig. 2 shows our dataset in
the space of three features h1, h2, h3. As expected, most data
points for text files have the lowest entropies, most data points
for encrypted files have the highest entropy values, and most
data points for binary files stand in between.
We then apply two classification methods, decision tree
(CART) [18] and Support Vector Machines (SVM) [19] to
each dataset. The results are shown in Fig. 3(a) and 3(b) and
Table I. The classification accuracy in this paper is represented
by the F1 score which is a single-valued summary of preci-
sion and recall and enables simple comparison of classifier
accuracy for different classes. Using CART, we can achieve
a classification accuracy about 79.19%. The classification
accuracy for each class is very close to the total accuracy
(i.e., the accuracy for all classes). The misclassification rate
for each class is about 10% for CART. Using SVM, which
are commonly used in machine learning for complex feature
spaces, after model selection, we achieve best classification
accuracy rate on SVM with Radial Basis Function (RBF)
kernel by γ = 50 and C = 1000 parameters. For multi-
class classification, we also use DAGSVM [20], which is the
fastest among other multi-class voting methods [21]. Using
this classification model, the total accuracy is improved up
4
00.2
0.40.6
0.81 0
0.5
10
0.2
0.4
0.6
0.8
h2
h1
h3
text
binary
encrypted
Fig. 2. Dataset in space of h1,h2, and h3
1 2 3 4 5 6 7 8 9 100
0.2
0.4
0.6
0.8
1
Cla
ssific
ation A
ccura
cy
1 2 3 4 5 6 7 8 9 100
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
Cross Validation Index
Mis
cla
ssific
ation
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
Cross Validation Index
text −> binary
text −> encrypted
binary −> text
binary −> encrypted
encrypted −> text
encrypted −> binary
text
binary
encrypted
total
(a) Classification accuracy using CART (b) Classification using SVM with
RBF kernel (C=1000, γ=50)
Fig. 3. File classification accuracy for CART and SVM
Decision Tree (CART) SVM - RBF Kernel (γ=50,C=1000)Precision/Recall Misclassification Precision/Recall Misclassification
Total Accuracy 79.19% Text Binary Encrypted 86.51% Text Binary Encrypted
Text File 80.90%/79.94% - 8.09% 11.97% 95.69%/78.65% - 3.18% 18.16%Binary File 80.38%/79.34% 8.57% - 12.05% 93.15%/84.11% 3.35% - 12.56%Encrypted File 76.52%/78.25% 10.46% 11.29% - 75.92%/96.79% 0.20% 3.01% -
TABLE ICLASSIFICATION ACCURACY FOR CART AND SVM
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
Buffer size from the beginning of the file (KBytes)
JS
D(P
||Q
) (e
lem
ent/
sym
bol)
0 50 100 150 200 2500
0.1
0.2
0.3
0.4
0.5
Buffer size from the beginning of the file (KBytes)0 50 100 150 200 250
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Buffer size from the beginning of the file (KBytes)
text
binary
encrypted
text
binary
encrypted
text
binary
encrypted
(c) Three bytes probability distribution distance(b) Two bytes probability distribution distance(a) Single byte probability distribution distance
Fig. 4. The JSD distance of element frequency distribution of the first b bytes of each file and all bytes of each file
to 86.51%. Comparing CART and SVM, the results show
that when we use SVM instead of CART, the classification
precision for text class increases up to 14.79% while the recall
value decreases slightly. This indicates that by using SVM, we
have significantly less number of false positive an almost the
same number of false negative. However for binary class, if
we use SVM both precision and recall values increase 12.87%
and 4.77%, respectively. For encrypted class, SVM decreases
the precision value by 0.6% but it increases the recall value
significantly by 21.54%. The reason is misclassification rates
(i.e., number of false positive and false negatives) in SVM with
RBF kernel are quite different from the CART’s due to the
nonlinear nature of the RBF kernel. For instance, using SVM,
the misclassification rate for text data points that are classified
as encrypted is increased up to 8%, and the misclassification
rate for encrypted data points that are classified as text is
decreased up to 10%.
To validate hypothesis 2, we use the Jensen-Shannon di-
vergence (JSD) measure (also known as the total divergence
to the average) [22], which is a popular method for cal-
culating the distance between two probability distributions.
JSD is the average of Kullback-Leibler (KL) distances of two
distributions to their average distribution. JSD is smooth and
bounded within [0,1]. It also satisfies the symmetry condition.
Let P and Q be two probability distributions. The KL distance
(also known as relative entropy) between P and Q is defined
as KLD(P‖Q) =∑
i pi logpi
qi. Let M be the average
probability distribution (M = P+Q2
). The Jensen-Shannon
divergence denoted by JSD(P‖Q) is defined as follows. Note
that JSD(P‖Q) = 0 if and only if P = Q.
JSD(P‖Q) =1
2KLD(P‖M) +
1
2KLD(Q‖M)
=1
2(∑
i
pi logpi
mi
+∑
i
qi logqi
mi
) = H(M)−1
2H(P )−
1
2H(Q) (3)
where H(M) = −∑
mi logmi.
In hypothesis 2, for a given file, P is the byte probability
distribution of the first b bytes of the file and Q is the byte
probability distribution for the entire file. We randomly choose
3,000 files from our dataset (1000 file from each class) with
the average file size is 153.93KB (90% of flies are less than
300KB and 10% are between 300KB and 1MB). Fig. 4(a),
(b), and (c) shows JSD(P‖Q) for single byte element set
f1, two bytes element set f2, and three bytes element set
f3, respectively. The results show that the distance between
randomness of the beginning byte string of a file and the
entire file is relatively small. For example, Fig. 4(b) shows
that the two-byte probability distribution only 8KB byte string
from the beginning of a file represents the entire file with
96.57%, 89.26%, and 87.24% of similarity for text, binary,
and encrypted files, respectively. Therefore, we can deduce
that the randomness of the beginning of a file can represents
the randomness of the entire file.
5
IV. FLOW CLASSIFICATION USING ENTROPY VECTORS
By treating a data flow as a sequence of bytes, we can
classify flows using our hypotheses based on our hypotheses
in Section III. As no real-world packet traces with payload is
publicly available for our experiments due to privacy concerns,
we use real-world packet header traces with payload randomly
chosen from our collected files. For better classification per-
formance, we first use feature selection methods to shrink the
feature space without considerable accuracy degrading. We
then choose buffer size b such that it is small enough to avoid
long delays and large space usage, yet big enough to keep the
accuracy high. We further improve classification efficiency and
reduce memory consumption using entropy vector estimation.
A. Feature Selection
Feature selection is necessary for many reasons including
improving computational efficiency and alleviating the curse
of dimensionality to achieve higher accuracy. For CART, the
higher a feature is in a tree, the more effective a feature is for
classification. We use a voting scheme for feature selection
on 10 decision trees of 10 cross-validation. In order to find
distinctive features, we prune the trees until we reach the
threshold of 2% decrease in accuracy. Then, the features that
are more frequently used in the pruned trees are chosen as
the selected features. This feature selection method yields the
feature set {h1, h3, h4, h10}. For SVM, we use the Sequential
Forward Searching (SFS) algorithm [23] for feature selection.
Let n be the total number of features and n′ be the number
of selected features. In SFS, we start with an empty set of
features. On each iteration, we add one feature and evaluate
the classification accuracy using SVM. Then, we choose the
feature that has the maximum accuracy among all features. The
search terminates until we select n′ features. As we apply this
algorithm on all cross-validation sets, we use a voting mech-
anism to choose the best features. The feature set chosen by
this process is {h1, h2, h3, h9}. For both CART and SVM, we
prefer features hk with smaller k because they are much more
memory efficient to calculate. Factoring in this preference, the
feature sets that we choose are φCART = {h1, h3, h4, h5}, and
φSVM = {h1, h2, h3, h5}.
Table II shows that after feature selection, classification
accuracy slightly changes where in some cases even increases.
It also shows that replacing h10 by h5 for CART and replacing
h9 by h5 for SVM have little impact on classification accuracy.
Decision Tree (CART)〈h1 . . . h10〉 〈h1, h3, h4, h10〉 〈h1, h3, h4, h5〉
Total 79.19% 79.20% 78.61%
Text File 80.34% 80.23% 79.90%Binary File 79.88% 79.89% 79.05%Encrypted File 77.36% 77.43% 76.87%
SVM - RBF Kernel (γ=50,C=1000)〈h1 . . . h10〉 〈h1, h2, h3, h9〉 〈h1, h2, h3, h5〉
Total 86.51% 86.08% 85.41%
Text File 86.33% 86.29% 86.06%Binary File 88.39% 87.69% 86.65%Encrypted File 85.09% 84.56% 83.85%
TABLE IICLASSIFICATION ACCURACY AFTER FEATURE SELECTION
B. Choosing Buffer Size
Next, we discuss how to determine b, the size of the buffer
used to store packet payload for calculating entropy vectors. To
achieve high time and space efficiency, this buffer needs to be
small. To achieve high classification accuracy, this buffer needs
to be large enough. Finding a good tradeoff for the buffer size
is an important issue. The optimal value of b depends on how
the classifier was trained. If we train the classifier using the
entire content of every training file and classify flows using
their first b bytes, Fig. 5(a) shows that we can achieve an
accuracy of 86% using a buffer of 1K bytes using SVM based
on the same classification model that was used for the file
classification. If we train the classifier using the the first bbytes of every training file and classify flows using their first
b bytes, Fig. 5(b) shows that we can achieve an accuracy of
86% using a buffer of 32 bytes for both SVM and CART.
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (Bytes)
Cla
ssific
ation A
ccura
cy
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (Bytes)
CART SVM−RBF kernel
(b) Classification accuracy by trainingwith the first b bytes of each file
(a) Classification accuracy by trainingwith entire file content
Fig. 5. Classification accuracy based on buffer size b
We choose the second training method because it uses much
smaller buffer size to achieve the same accuracy. Using a
1KB buffer costs much more time and space than using a
32 bytes buffer. Fig. 6 shows the time and space used by
our classifier increase linearly as buffer size increases. Our
classifier was implemented in C++ and the experiments were
carried out on a Linux workstation with AMD 64 Athlon Dual
core and 2.5 GB of RAM. Based on Fig. 6(a), the entropy
vector calculation time using the second training method with
b = 32 is almost 10 times less than the first training method
with b = 1024. Based on Fig. 6(b), the space per flow using the
second training method is 30 times less than the first training
method. The error bars show the max and min values.
102
104
0
5
10
15
20
25
30
35
Buffer Size (b)
Tim
e (
ms)
102
104
102
103
104
Buffer Size (b)
Sp
ace
(B
)
(a) Entropy vector calculation time (b) Entropy vector calculation space
Fig. 6. Entropy vector calculation
6
C. Handling Application Layer Headers
Many application protocols have headers. For instance, a
picture in an html page is fetched from a web server in a
separate flow with an HTTP header, which consists of text.
Such a binary flow with application headers in text could be
misclassified if the classification is solely based on the first
few bytes of the flow. For headers of well-known application
protocols, such as HTTP, SMTP, IMAP, and POP, as they
have a known format, our classifier strips them off using
signature based header detection techniques [5], [6] and uses
only application payload for classification purposes. However,
based on Internet flow statistics [1], known application flows
and encrypted flows constitute around 50% and 10% of the
Internet traffic, respectively; thus, almost 40% of flows are
unknown. Those include the flows that have no application
headers (such as FTP file transfer flows [24]) and the flows
that have unknown application headers. As application header
sizes vary for different applications, we use a cut-off threshold
T as a maximum number of header bytes. That is, to calculate
an entropy vector, we treat the (T + 1)-th byte in a flow as
the beginning of the flow.
To examine classification accuracy using cut-off threshold
T , for each experimented flow, we randomly choose an appli-
cation header length Y (Y ≤ T ) assuming that the (Y + 1)-th byte is the true beginning of the flow. We now have three
methods to train our classifier: (1) entire file - we use entire file
content, (2) static-b - we use the first b bytes of each file, (3)
dynamic-b - we use b consecutive bytes starting at a random
location ranging from the first byte to the (T + 1)-th byte.
However for the testing, we use b consecutive bytes starting at
the (T+1)-th byte. Fig. 7 compares the classification accuracy
achieved by these three training methods based on different
buffer sizes. The results show that the classification accuracy
of these three training methods do not significantly differ. This
implies that the randomness of a portion of a flow does not
change significantly over the entire flow. The results also show
that classification accuracy improves as buffer sizes increase.
Comparing SVM and CART, SVM achieves up to 10% better
accuracy than CART for most buffer sizes. Our experimental
results show that with unknown application headers removed,
our classifier achieves a classification accuracy of 80% with a
buffer of 1024 bytes.
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.2
0.4
0.6
0.8
1
Buffer size (b)
Cla
ssific
ation A
ccura
cy
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.2
0.4
0.6
0.8
1
Buffer size (b)
entire static dynamicentire static dynamic
(b) Classification acuracy using CART(a) Classification acuracy using SVM
Fig. 7. Classification accuracy by different training methods
D. Entropy Vector Estimation
1) Overview: Calculating exact entropy vectors for each
flow can be costly in time and space when we use a buffer
of 1K bytes. In Iustitia, we use a (δ, ǫ)-approximation al-
gorithm to estimate Sk ≡∑
imik logmik in Formula (2).
This algorithm is proposed for estimating the entropy of data
streams by Lall et al. in [25]. The basic idea is to use
frequency moments estimation techniques [26]. The estimation
algorithm has relative error of at most ǫ with probability at
least 1 − δ. Thus, given X whose estimated value is X̃ , we
have Pr(|X − X̃| ≤ Xǫ) ≥ 1 − δ. Note that the entropy
estimation algorithm is based on the assumption |fi| ≫ b.This implies that we cannot use the estimation algorithm for
h1 because |f1| = 256 is too small. However, the rest of the
features use the estimation algorithm because |fi| = 256i is
big enough for i ≥ 2.
In the (δ, ǫ)-approximation algorithm, the number of coun-
ters required for entropy estimation is significantly less than
the number of counters required for accurate entropy cal-
culation because the number of elements in fi increases
exponentially. The number of counters for feature φi is g×zi.Note that g and zi are calculated based on δ and ǫ as follows:
zi = ⌈32 log|fi| b/ǫ2⌉, and g = 2 log2(1/δ).
To save space in entropy estimation, we need to ensure that
the number of counters is less than the number of counters in
actual entropy vector calculation. Theoretically, we assign one
counter for each element of a feature, thus the total number
of counters is (∑
∀φi 6=h1|fi|). However, in practice, most of
the counters are 0 as |fi| ≫ b. Let α be the total number of
counters in exact calculation. We calculate the lower bound
for ǫ and δ as follows:∑
∀φi 6=h1
g × zi =∑
∀φi 6=h1
32 log|fi| b · 2 log2(1/δ)
ǫ2< α (4)
Replacing |fi| by 28i, we can get the following bound
for ǫ where Kφ is the feature set coefficient and Kφ =8∑
∀φi 6=h1
1
i.
∑
∀φi 6=h1
8 log2b · log
2(1/δ)
ǫ2i< α ⇒ ǫ >
√
Kφ
log2b
αlog2(1/δ) (5)
For our feature sets, KφSVM= 8.26, KφCART
= 6.26. Also,
for our dataset with b = 1024, we estimate α ≃ 1911. Thus,
for our dataset formula (5) is reduced to ǫ > 0.18√
log2(1/δ).Our entropy vector estimation algorithm works in two
stages. In the pre-processing stage, for each feature hi ∈ φ,
array Li contains random g× zi locations in the buffer. In the
online stage, the estimator waits until the buffer is filled or its
timeout is reached. Then for each flow the entropy vector is
estimated as follows. First, for each feature hi and for each
location in Lij , we pick i consecutive bytes in the buffer as
an element, then scan the buffer starting from the location to
the end and count the number of i consecutive bytes that are
equal to that element. Second, we group these counters into
g groups where each group has zi counters. Third, for each
counter cij , we calculate b(cij log cij − (cij − 1) log(cij − 1))as the unbiased estimator for Sk. Forth, for each group, we
calculate the average of the zi unbiased estimators. Fifth, we
calculate the median of the g average values as the estimated
7
δ
(i)
Usin
g S
VM
with R
BF
Kern
el
ε
10−2
10−1
100
δ
10−2
10−1
100
δ
10−2
10−1
100
δ
10−2
10−1
100
δ
(ii) U
sin
g D
ecis
ion T
ree (
CA
RT
)
ε
10−2
10−1
100
δ
10−2
10−1
100
δ
10−2
10−1
100
δ
10−2
10−1
100
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
(a) Text class accuracy rate (d) Total accuracy rate(c) Encrypted class accuracy rate(b) Binary class accuracy rate
Fig. 8. Classification accuracy based on different g and z values
0 10 20 30 40 50 60 70 8010
4
105
106
107
Time Unit (s)
Total # of packets
Total # of flows
CT size w/o purging
CT size with purging
Fig. 9. Comparing CT size with total number offlows and packets
value of Sk. Finally the estimated entropy for feature hi is
calculated using Formula (2).2) Identifying ǫ and δ: As our main objective is to get
higher classification accuracy with smaller number of counters,
we examine classification accuracy for each possible value of ǫand δ. As shown in Fig. 2, the data points for different classes
are close to each other in the space and they are not well
segregated. This implies that the (δ, ǫ)-estimated entropy vector
data points induce more overlaps rather than exact entropy
vector data points. Thus, we expect the misclassification rate
to increase. Based on the experimental results using the
(δ, ǫ)-approximation algorithm shown in Fig. 8, we make the
following observations: (1) Using SVM with the dynamic-btraining method where 1K buffer size, we have a classification
accuracy of 81.3%, and the optimal values of ǫ and δ are
0.25 and 0.75, respectively. However, to improve classification
accuracy, we reapply the model selection on the new dataset,
and obtain a classification accuracy of 83% using γ = 10and C = 1000. Fig. 8(i) shows the classification accuracy
for different classes using different ǫ and δ values classified
by SVM with new model parameters. (2) Using CART with
the dynamic-b training method where 1K buffer size, we have
a classification accuracy of 76.03%, and the optimal values
of ǫ and δ are 0.5 and 0.1, respectively. Fig. 8(ii) shows the
classification accuracy for different classes using different ǫand δ values. (3) The entropy vector estimation algorithm is
not effective for small buffers of size such as 32 bytes.
3) Time and Space Analysis: Table III compares the time
and space requirement for calculating and estimating entropy
vectors. Comparing the entropy vector exact calculation and
estimation, the estimation process requires almost three times
less memory, although it takes three times more time. In fact,
the buffer size is not big enough for this algorithm to be time
efficient; however, it reduces the required space significantly.
Calculation EstimationTime Space Time Space
b=1024BSVM 5,428µs 5.1KB 16,421µs 1.6KBCART 5,658µs 5.3KB 18,751µs 1.85KB
b=32BSVM 326µs 195B
- -CART 276µs 192B
TABLE IIITIME AND SPACE FOR CLASSIFICATION
E. Classifier Delay Analysis
As shown in Fig. 1, Iustitia assigns a buffer of size b for
each incoming traffic flow, and once the buffer is full, it
calculates/estimates the flow’s entropy vector for classification.
To analyze classifier delay, we use a gigabit gateway trace from
the UMASS Trace Repository [27], [28] that has 11,976,410
packet records where 41.16 % of them are UDP or TCP data
packets. The packet rate in this trace is 146,714.38 packets per
second. This trace contains 299,564 data flows. Without appli-
cation header removal, we performed two sets of experiments
using buffer sizes 32 and 1024, respectively. With application
removal, we performed another two sets of experiments using
476 and 976 as the maximum application header lengths,
respectively; both sets of experiments use buffer size 1024.
Because we do not know the exact application header length
for each flow, we cut off all flows for the same number of
bytes uniformly by the maximum application header length
that we choose.
There are two types of classifier delays in Iustitia, post-
classification packet delay (PCPD) denoted by τ and pre-
classification flow delay (PCFD) denoted by τ ′. Post-
classification packet delay is the classifier delay for packets
that belong to a flow that has been already classified. Pre-
classification flow delay is the classifier delay for the buffered
packets that belong to a new flow. Based on Iustitia’s archi-
tecture shown in Fig. 1, PCPD comprises packet header hash
calculation delay (τhash ) and classification table (CT) query
delay (τCT ), whereas PCFD comprises buffer filling delay
(τb), entropy vector calculation delay (τEV ), and classification
delay (τc). PCPD and PCFD are calculated as follows:
τ = τhash + τCT (6)
τ ′ = τb + τEV + τc (7)
1) Post-Classification Packet Delay (PCPD): According to
Formula (6), PCPD is the sum of hash function delay τhashand CT search delay τCT . The hash calculation delay caused
by hash functions such as MD5 is relatively constant and less
than 15µs. The CT query delay is also constant as we use
hash tables to implement CT. Yet, to keep the CT memory
usage reasonable, we need to ensure that the CT does not
contain records for obsolete flows. In other words, the size of
CT (i.e., number of records) must be very close to the number
8
of actual flows passing through the classifier. To purge the CT
from obsolete TCP flows, once a packet with RST or FIN flag
is received, we need to remove its corresponding flow record
from CT. Fig. 9 shows that up to 46% of the TCP flows can
be removed using this technique. However, some applications
do not close their TCP sockets properly and furthermore UDP
flows do not have packets with FIN or RST flags. Hence,
we use the packet arrival time to detect and remove obsolete
flows. We assume that a flow Fi is obsolete if the following
condition holds: t − tFi≥ nλFi
, where t is the current time,
tFiis the arrival time of the last packet of the flow Fi, n is an
arbitrary coefficient, and λFiis the inter-arrival time from the
last two packets for the flow Fi. For implementation, we have
two options: (1) we can use default λ = 0.5s that applies to
all the flows. (2) For each flow (i.e., record in CT), we can
keep λFi. Hence, each record in CT contains 130 and 162 bits
using respectively options (1) and (2), containing 128 bits for
the MD5 hash as the key, 32 bits for λFi(for option (2)) and
2 bits for the flow label as the value. Note that λFimay vary
significantly for different packet traces and for different time
periods. This is because the packet inter-arrival time is tightly
dependent on wide range of packets inter-departure time for
different flow application protocol, network throughput and
available bandwidth, and network congestion. Therefore, it is
advised to choose between option (1) and (2) based on the
dynamics of the network and network flow statistics as well as
the Iustitia application. Fig. 10(a) and (b) show the cumulative
probability (CDF) of packet payload size and λF for our packet
trace, respectively.
0 300 600 900 1200 15000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Packet Payload Size (Bytes)
Cu
mu
lative
pro
ba
bili
ty (
CD
F)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Packet Interarrival Time (s)
(a) Cumulative Probability of packetpayload size
(b) Cumulative Probability of Packetinter−arrival time
Fig. 10. CDF of packet size and inter-arrival time
We also need to determine parameter n based on the given
time and space requirements. While using larger n values may
lead to larger CT as more obsolete records can be in CT, small
n may lead to unnecessary flow classification for the flows
that have been classified before but have been removed from
CT. Thus, considering the classification required space shown
in Table III, we are interested in the n values that are just
large enough to avoid reclassification of ongoing flows. Our
experimental results show that n = 4 is an optimal value for
the given traffic trace, yet that could be tuned differently for
other packet traces. Fig. 9 illustrates that the CT size after
pruning is fairly constant on 29,713 flows over time.
Fig. 11 shows the cumulative probability of average PCPD
per flow in our trace. This figure shows that up to 98% of the
packets has delay less than 20µs, which is significantly less
than average inter-arrival time for most of flows.
10 13 16 19 22 25 28 300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
τ (microseconds)
Cu
mu
lative
Pro
ba
bili
ty (
CD
F)
Fig. 11. CDF of average post-classification packet delay per flow
2) Pre-Classification Flow Delay (PCFD): According to
Formula (6), PCFD is the sum of buffering delay (τb), the
entropy vector calculation/estimation delay (τEV ), and clas-
sification delay (τc). It is clear that τb can be calculated as
τb =∑c
j=1λj , where c is the number of required packets to
fill the buffer. Considering fixed buffer size b, the number
of packets is inversely proportional to the packet payload
size. The cumulative distribution of the packet payload size in
the given packet trace (shown in Fig. 10(a)) follows bimodal
packet size probability distribution, where up to 10% of the
packets have payload size of 1480 and more than 60% of
the packets have a payload size of less than 140 bytes. This
implies that the average number of required packets to fill
a buffer with b = 32 is almost 1, although it may increase
for larger buffers. Note that some buffers may not be filled
completely because the buffer timeout can be elapsed and
an incomplete buffer is sent for classification. To evaluate
PCFD and the total number of packets to fill a buffer, we
divide our packet trace into chunks of packet traces within 1
second time units. For each unit, we group the flows that are
initiated in each time unit and calculate the average number of
required packets and PCFD for each group of flows. Fig. 12(a)
shows the average number of packets before classification
where buffering timeout is 2 seconds. According to the figure,
the average number of packets before classification is 1.1 for
b = 32 and between 1.4 and 1.6 for larger buffers. Note that
the percentage of the number of buffers that are completely
filled before classification is 75% for b=32, and is less than
14% for larger buffers.
20 40 60 801
1.2
1.4
1.6
1.8
2
Time Units (seconds)
c
20 40 60 8010
−2
10−1
100
Time Units (seconds)
τ’ (s
econds)
buffer size=32
buffer size=1024
MHL=476, buffer size=1024
MHL=976, buffer size=1024
buffer size=32
buffer size=1024
MHL=476, buffer size=1024
MHL=976, buffer size=1024
(a) Average number of packets used forclassification
(b) Average total pre−classification delay
Fig. 12. Number of required packets to fill the buffer and the totalclassification delay for different buffer sizes
9
The entropy vector calculation/estimation delay (τEV ) has
been shown in Table III. The classification delay (τc) using
SVM and CART is almost a constant, which is less than 2µs.
This is trivial compared to τb and τEV . Fig. 13 shows the CDF
of Iustitia online classification delay (τEV + τc) over average
inter-arrival time for a flow. This graph shows that when b =32 and b = 1024, the online classification time delay for 94.9%
and 91.2% of the flows is less than 0.1 of average inter-arrival
time of the flows, respectively. Therefore, the buffering delay
(τb) is the dominant factor in PCFD in Iustitia. Note that in
Fig. 13, the SVM and CART graphs are almost the same,
which suggests that they induce same delay in the system.
Fig. 12(b) shows the pre-classification flow delay. This figure
clearly illustrates the τb domination. In this figure, for b = 32the τ ′ = 44ms± 13ms. However, for larger buffer sizes that
require more number of packets, we have τb > 230ms. Note
that Fig. 12(b) does not include the buffer timeout delay.
10−6
10−5
10−4
10−3
10−2
10−1
100
101
102
103
104
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(τEV
+τc) / Average_Interarrival_Time
Cum
ula
tive P
robabili
ty (
CD
F)
b=32 (SVM)
b=1024 (SVM)
b=32 (CART)
b=1024 (CART)
Fig. 13. CDF of ratio of classification delay and inter-arrival delay
F. Discussion
1) Classifying Multi-nature Flows: In practice, a flow may
include contents with different natures. For instance, a simple
mail transport protocol (SMTP) flow may include a text email
with an image attachment. The SMTP client can also turn on
the SSL capability and transmit the email message encrypted.
Although, such flows are considered binary flow based on our
flow nature definitions, they can be misclassified as only the
entropy vector of the beginning of the flow is calculated. Such
misclassification can also be abused by attackers to defraud
Iustitia, when it is used for security applications. Attackers
can add some deceiving padding in the beginning of a flow to
cause misclassification. For instance, if we prioritize different
output buffers such that binary flows have the highest priority
and encrypted flows have the lowest priority, an attacker may
put some encrypted-like padding to the beginning of a flow
and send it over to bypass complex signature matching.
For security applications and those applications that concern
about misclassification of multi-nature flows, one solution is
to periodically delete the CT record of a flow, which in
turn causes the flow to be reclassified. Since flows can have
different rates, the period is based on a certain amount of
payload bytes that has been received. We further look into
periodical classification of multi-nature flows and its accuracy
on real-life flows in section VI.
2) Network Tunneling: Network tunneling adds more com-
plication for flow classification. A tunnel may contain multiple
flows with different natures. If the tunnel is encrypted, we
classify the tunnel as an encrypted flow. If the tunnel is not
encrypted, it can be classified either as binary or multi-nature
flows. However, if we can strip off the tunnel header, we can
use Iustitia for every flow inside the tunnel and classify them
separately.
V. FINER GRAINED FLOW NATURE IDENTIFICATION
In this section, we investigate the finer-grained classification
of flows than the three broad categories of text, binary, and
encrypted. Specifically, for encrypted flows, we classify them
based on different encryption algorithms; for binary flows, we
first classify them into audio, video, image flows, executable,
compressed flows, and document flows. Even further, for
binary flows, we classify them based on the binary content
format; for example, we classify audio flows to categories such
as MP3, WAV, and WMA.
A. Encrypted Flows
We created a dataset of 13656 flows using 6 encryption
algorithms including Advanced Encryption Standard (AES),
Pretty Good Privacy (PGP), Data Encryption Standard (DES),
Blowfish (BF), Triple DES (DES3), and RC4. Fig. 14(a)
shows that the total classification accuracy based on the first
b bytes of each flow varies from 25% to 50%. If each of
these encryption algorithms outputs totally randomly for the
first b bytes, the classification accuracy would have been
1/6 = 16.6%. However, our classification accuracy is much
better than 16.6%. This implies that the distribution of byte
randomness for the beginning portion of encrypted flows is
not quite uniform. However, the descending trend of the graph
indicates that by using larger buffers, the distribution of byte
randomness becomes more uniform. To understand better the
distribution of byte randomness in encrypted flows, we repeat
our experiment using a random chosen consecutive of bytes
in the middle of each flow. Fig. 14(b) shows that the classi-
fication accuracy then varies from 16% to 25%. In summary,
we find that the beginning portion of encrypted flows has
more information than the middle portion for identifying the
encryption algorithms. These findings are quite interesting and
surprising. They may have applications in cryptanalysis and
digital forensics.
We further study the effectiveness of our method for each
encryption method. Fig. 15 shows that surprisingly flows
encrypted by AES and PGP can be classified with up to
99.9% accuracy. However, the classification accuracy is not
good enough for other encryption methods, although better
than random guessing, the classification accuracy of which
is 1/6 = 16.6%. The classification accuracy for randomly
choosing b bytes in the flow, instead of from the beginning, is
shown in Fig. 16. As expected, the classification accuracy is
much lower. Surprisingly, the classification accuracy for AES
flows is better than others.
10
SVM−RBF kernel Decision tree (CART)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1C
lassific
ation A
ccura
cy
SVM−RBF kernel Decision tree (CART)0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
b’=8
b’=16
b’=32
b’=64
b’=128
b’=256
b’=512
b’=1K
b’=2K
b’=4K
b’=8K
b’=16K
b’=32K
b’=64K
b=8
b=16
b=32
b=64
b=128
b=256
b=512
b=1K
b=2K
b=4K
b=8K
b=16K
b=32K
b=64K
(a) Total classification accuracy for encryptedflows using first b bytes of the flow
(b) Total classification accuracy for encryptedflows using random b bytes of the flow
Fig. 14. Total classification accuracy for encrypted flows
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
AES
PGP
DES3
BF
DES
RC4
(b) Classification accuracy for each encryptionmethod using Decision Tree (CART)
(a) Classification accuracy for each encryptionmethod using SVM
Fig. 15. Classification accuracy for each encryption method using first bbytes of each flow
B. Binary Flows
We created a dataset of 15855 flows, the details of which
are shown in Table IV. For these binary flows, we first classify
them into audio, video, image flows, executable, compressed
flows, and document flows. We call this “binary flow type clas-
sification”. Then, we classify them based on the binary content
format. We call this “binary flow format classification”.
1) Binary Flow Type Classification: Fig. 17(a) shows that
the total accuracy ranges from 87% to 96% for SVM and
from 77% to 96% for CART. This high accuracy indicates that
Iustitia can be indeed useful for finer grained classification of
binary flows. The reason behind high classification accuracy
on very small buffer sizes is that files of the same formats often
have the similar application headers and often contain the same
file magic numbers [29]. The accuracy on larger buffer sizes
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
AES
PGP
DES3
BF
DES
RC4
(b) Classification accuracy for each encryptionmethod using Decision Tree (CART)
(a) Classification accuracy for each encryptionmethod using SVM
Fig. 16. Classification accuracy for each encryption method using randomb bytes of each flow
Dataset
Class Extension Format
ExecutablesN/A Linux Binaries.exe Windows Executable filesN/A Packed Executables
Image.jpg JPEG file.png Portable Network Graphics.gif Graphics Interchange Format
Documents .pdf Portable Document Format
Compressed.zip ZIP file.tar.gz GNU zip
Video.avi Audio Video Interleave.mpg MPEG-1 format
Audio.mp3 MPEG-1 Audio Layer 3.wav WAVE file (for windows).wma Windows Media Audio
TABLE IVDATASET DETAILS FOR BINARY FLOW CLASSIFICATION
is relatively higher. This implies that the beginning portion of
each binary format has its own almost unique byte randomness
distribution. Further, we investigate the classification accuracy
using the entropy vector of random b bytes in the flow.
Fig. 17(b) shows we can have up to 80% of total classification
accuracy by using SVM with relatively large buffer sizes of
b > 16KB.
SVM−RBF kernel Decision tree (CART)0
0.2
0.4
0.6
0.8
1
Cla
ssific
ation A
ccura
cy
b=32
b=64
b=128
b=256
b=512
b=1K
b=2K
b=4K
b=8K
b=16K
b=32K
b=64K
SVM−RBF kernel Decision tree (CART)0
0.2
0.4
0.6
0.8
1b’=32
b’=64
b’=128
b’=256
b’=512
b’=1K
b’=2K
b’=4K
b’=8K
b’=16K
b’=32K
b’=64K
(a) Total classification accuracy for binary flowsusing first b bytes of the flow
(b) Total classification accuracy for binary flowsusing random b’ bytes of the flow
Fig. 17. Total classification accuracy for binary flows
The flow classification accuracy for each flow type is shown
in Fig. 18. We observe that some binary flow types such as
videos and executables have very high accuracy of 96% to 99%
and 92% to 96%, respectively. These results not only suggest
that Iustitia is very effective for extracting executable flows
for security purposes and video flows for copyright validation,
but also imply that other binary flows can be classified with
relatively high accuracy (Images: 70% to 86%, Documents:
81% to 99.88%, Compressed: 57% to 97%, Audio:63% to
96%). Fig. 19 also shows that using large buffer sizes for
executable, document, compressed, and audio binary flows,
we can achieve a classification accuracy from 80% to 95%.
2) Binary Flow Format Classification: Fig. 20 shows clas-
sification accuracy results of popular formats of binary flows.
In this figure, the rows illustrate the accuracy results for each
specific type of flows, and the columns illustrate the accuracy
results of the classification using SVM and CART when the
entropy vector is calculated by the first b bytes and random bconsecutive bytes of each flow.
The first row of the figure is for executable flows. Figures
(a) and (b) show that the first b bytes of each flow can
be effectively used for classification (especially for small bvalues). This suggests that these flows may have specific and
well formatted header structures. Figures (c) and (d) also
show high accuracy results on random b bytes of each flow.
11
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)C
lassific
ation A
ccura
cy
Linux Exec.
Windows Exec.
Packed Exec.
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
JPEG
PNG
GIF
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
GZIP
ZIP
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
AVI
MPEG
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
MP3
WAV
WMA
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
32 64 128 256 512 1K 2K 4K 8K 16K32K64K0
0.2
0.4
0.6
0.8
1
Buffer Size (bytes)
(a) Classification accuracy for executables usingSVM and first b bytes of the flow
(b) Classification accuracy for executables usingCART and first b bytes of the flow
(c) Classification accuracy for executables usingSVM and random b bytes of the flow
(d) Classification accuracy for executables usingCART and random b bytes of the flow
(e) Classification accuracy for images using SVMand first b bytes of the flow
(f) Classification accuracy for images using CARTand first b bytes of the flow
(g) Classification accuracy for images using SVMand random b bytes of the flow
(h) Classification accuracy for images using CARTand random b bytes of the flow
(i) Classification accuracy for compressed usingSVM and first b bytes of the flow
(j) Classification accuracy for compressed usingSVM and first b bytes of the flow
(k) Classification accuracy for compressed usingSVM and random b bytes of the flow
(l) Classification accuracy for compressed usingCART and random b bytes of the flow
(m) Classification accuracy for videos using SVMand first b bytes of the flow
(n) Classification accuracy for videos using SVMand first b bytes of the flow
(o) Classification accuracy for videos using SVMand random b bytes of the flow
(p) Classification accuracy for videos using CARTand random b bytes of the flow
(q) Classification accuracy for audios using SVMand first b bytes of the flow
(r) Classification accuracy for audios using SVMand first b bytes of the flow
(s) Classification accuracy for audios using SVMand random b bytes of the flow
(t) Classification accuracy for audios using CARTand random b bytes of the flow
Fig. 20. Classification accuracy for different application-specific extension for executable, image, compressed, video, and audio flows
32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
Exec
Image
Doc
Comp
Video
Audio
32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
(b) Classification accuracy for each binaryflow type using Decision Tree (CART)
(a) Classification accuracy for each binaryflow type using SVM
Fig. 18. Classification accuracy for each binary flow type using first b bytesof the flow
32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
Cla
ssific
ation A
ccura
cy
Exec
Image
Doc
Comp
Video
Audio
32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (bytes)
(a) Classification accuracy for each binaryflow type using SVM
(b) Classification accuracy for each binaryflow type using Decision Tree (CART)
Fig. 19. Classification accuracy for each binary flow type using random bbytes of the flow
This shows that Iustitia can be effectively used to distinguish
between different executable flows. This is very important
because executable flow classification can be very useful for
detecting exploits and packed executables.
The second row of the figure is for image flows. Looking at
the accuracy results of Figures (e-h), the first b bytes of each
flow is more informative than random b bytes in each flow.
They also suggest that JPEG and GIF files can be detected
with relatively higher accuracy rather than PNG files in most
of the cases.
The third row of the figure is for compressed flows. From
the results in Figures (i-l), it is clear that using the first b bytes
of each flow is preferred over using random b bytes of each
flow. We further observe that the distribution of the frequency
of the bytes in GZIP and ZIP format is almost the same. This
explains the low accuracy results in figures (k) and (l).
The forth row of the figure is for video flows. Figures (m-p)
show that by using the first b of each flow, we can achieve
almost 100% accuracy. However, the accuracy is not as high
if we use random b bytes of each flow. In cases where b > 8,
the accuracy can go up to 90% for both AVI and MPEG flows.
The last row of the figure is for audio flows. Figures (q-t)
show that MP3 and WAV format can be classified with more
than 95% accuracy in most of the cases. WMA format, on the
other hand, requires relatively larger buffer sizes to achieve
high accuracy. Moreover, in general, the figures show that
using the first b of each flow is preferred over using random
b bytes of each flow for audio flow formats.
12
VI. EVALUATION ON REAL-LIFE TRACES
In this section we evaluate the accuracy of Iustitia on real-
life traffic traces. Note that the application protocol informa-
tion is not used for classification. Although such information
can be used to increase the accuracy of Iustitia, it is out of
the scope of this paper.
A. Traffic Trace Collection
Using available traffic traces to evaluate Iustitia is not
practical due to obvious privacy issues. To have a labeled
dataset as the ground truth, for a given set of flows we are
required to look into packet payloads and label flows manually.
Since this clearly violates users privacy and it is not approved
by Institutional Review Board (IRB), no accredited traces
with payload are publicly available. Moreover, distinguishing
between binary and encrypted flows by looking at the payload
manually, if it is not infeasible it is extremely inaccurate.
Hence, to evaluate Iustitia on real-life traces we first collected
our personal traces captured locally during a week. We also
built a testbed to collect traces for some unpopular services
and protocols. For instance, for web service protocols such as
HTTP and HTTPS flows, in addition to personal web browsing
traces, we use websphinx web crawler [30] to collect more
traces for the web protocols. For mail service protocols such
as IMAP, SMTP, and IMAPS/POP3S/SMTPS, in addition to
personal mail transactions, we built a mail server to send and
receive emails. For video conferencing services such as Skype
and oovoo [31], we made and received calls and collected the
traces. For remote desktop protocols such as RDP and VNC
[32] (using Remote Frame Buffer(RBF) protocol), we installed
a client and server on our testbed to generate traffic and collect
traces. For file transfer protocols such as TFTP and FTP, and
SFTP/SCP, we downloaded and uploaded some known files
and collect traces. Finally, for remote shell protocols such
as telnet and SSH, we installed a server and a client on
the testbed and collect traces. We then labeled known flows
accordingly based on their application layer protocol, and for
others such as HTTP and mail services we used the message
Multipurpose Internet Mail Extensions (MIME) extensions for
labeling. Table V shows the statistics of 6,409 flows including
1,556 text, 2,425 binary, and 2,428 encrypted flows collected
from our traces and their corresponding labels. Note that as
mentioned before, multi-nature flows are labeled as binary in
coarse-grained classification.
B. Evaluation Results
1) Classification Accuracy with No Application Header
Removal: We first evaluate Iustitia on real-life flows with
no application header removal. We first train classifier by
the first b bytes of files and use them to classify all single-
nature flows. Second, we use 10 times cross-validation on
real-life flows to evaluate classification accuracy. Fig. 21 and
Fig. 22 show the classification accuracy for different buffer
sizes and for different protocols, respectively. Fig. 21(a) shows
that classification accuracy is relatively low varying from
60% to 70%, when we train the classifier using the first b
bytes of files. Fig. 21(b) shows that classification accuracy is
substantially improved varying from 75% to 90%, when we
train the classifier using the first b bytes of flows.
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (Bytes)
Cla
ssific
ation A
ccura
cy
8 16 32 64 128 256 512 1K 2K 4K 8K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Buffer Size (Bytes)
CART
SVM−RBF kernel
(a) Classification accuracy by training with firstb bytes of files
(b) Classification accuracy by training with firstb bytes of real−life flows
Fig. 21. Classification accuracy of real-life flows with no header removal
Fig. 22 shows the classification accuracy per flow protocol
when the buffer size is 1KB. Fig. 22(a) indicates that while
some protocols such as SMTP, RDP, Skype, and SFTP are
classified with high accuracy when the classifier is trained by
the first 1K bytes of files, IMAPS/POP3S/SMTPS and telnet
have very low accuracy. On the other hand, Fig. 22(b) shows
that if we train the classifier using the first 1K bytes of flows,
the results significantly increases, yet telnet still have very
low accuracy rate. Looking into telnet traces, we realized
that even though telnet is labeled as text flows based on
our definition, telnet traces includes many control characters
that causes misclassification to binary flows. Because of such
encoding differences for some application protocols (such as
interactive text and binary sessions like telnet, and video
conferencing), to achieve highest accuracy, it is very important
to have enough data points for a wide range of application
protocols for training.
https http imaps imap smtp oovoo rdp vnc skype tftp telnet sftp ftp0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Flow Protocol
Cla
ssific
ton A
ccura
cy
https http imaps imap smtp oovoo rdp vnc skype tftp telnet sftp ftp0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Flow Protocol
Cla
ssific
ton A
ccura
cy
CART SVM−RBF kernel
(a) Classfication accuracy for each flow protocol by training with first 1K bytes of files
(b) Classfication accuracy for each flow protocol by training with first 1K bytes ofreal−life flows
Fig. 22. Classification accuracy of real-life flows with no header removal
2) Classification Accuracy with application header re-
moval: Next, we evaluate Iustitia on real-life flows when the
application header is cut off. As the application header size
changes for different protocols, we evaluate Iustitia on real-
life flows when the cut-off threshold changes from 8 bytes
to 8K bytes. Figure 23(a) shows the classification accuracy
for buffer size of 1K bytes, when the classifier is trained by
random 1K bytes of files. Figure 23(b) shows the classification
13
Application Layer Transport Number of flows per label ServiceProtocol Protocol Total text binary % of multi-nature encrypted Description
HTTP TCP 1,459 1,006 453 50.33% - webHTTPS TCP 238 - - - 238 secure webIMAP TCP 66 48 18 99% - mailSMTP TCP 220 43 177 98% - mailIMAPS/POP3S/SMTPS TCP 75 - - - 75 secure mailoovoo UDP 311 - 311 0% - voice/video chatRDP TCP 1,366 - - - 1,366 remote desktopVNC TCP 1,250 - 1,250 0% - remote desktopSkype UDP 466 - - - 466 secure voice/video chatTFTP UDP 19 14 5 0% - file transferFTP TCP 332 120 212 0% - file transferTelnet TCP 325 325 - - - shellSSH TCP 33 - - - 33 secure shellSFTP/SCP TCP 250 - - - 250 secure file transfer
TABLE VCOLLECTED REAL-LIFE FLOW TRACES DESCRIPTIONS AND STATISTICS
accuracy when the classifier is trained by 1K bytes of flows
after the header cut-off threshold. The results indicate that
when we use real-life flows for training, the classification
accuracy increases by 31.01% and 23.9% to 87.87% and
88.27% on average for CART and SVM, respectively. The
results also indicate that there is no certain cut-off threshold
that maximizes the accuracy for all the flows. The reason
can be stated in threefold. First, the application header can
cause misclassification for not all of the flows but only for
a subset of the flows that (1) are binary or encrypted, (2)
have different nature application header (e.g., text), and (3)
the application header is relatively large comparing to the
size of the buffer. Second, the size of the application header
varies in a wide range for different applications. Even for one
specific application the size of header can change dramatically.
For instance, the size of a cookie in Set-Cookie in HTTP
header can be as large as 4KB per cookie. Third, depending
on classification algorithm we use, the application header can
be helpful for classification. As the header for the flows from
same applications could be very similar, their randomness of
the beginning of the flow are almost the same, which in turn
can help to increase the classification header.
8 32 128 512 1024 4096 81920
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Header Cut−off Threshold Size (Bytes)
Cla
ssific
ation A
ccura
cy
8 32 128 512 1024 4096 81920
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Header Cut−off Threshold Size (Bytes)
CART
SVM−RBF kernel
(a) Classification accuracy by training withrandom 1K bytes of files
(b) Classification accuracy by training with 1Kbytes after threshold (T) for real−life flows
Fig. 23. Classification accuracy of real-life flows with header removal
Figure 24 shows the classification accuracy per flow with
application header removal. The results in Fig. 24(a) and (b)
show that by training the classifier with first 1K bytes of files
after a certain cut-off threshold, some flows are classified with
very high accuracy such as SMTP (text flows), oovoo, RDP,
and Skype, but some flows are classified with low accuracy
such as IMAPS/POP3S/SMTPS, IMAP and VNC. However,
the results in Fig. 24(c) and (d) show that by training the
classifier with the first 1K bytes after cut-off threshold, the
accuracy for almost all the flows increases except telnet. This
means that the first classifier is better trained to classify telnet
flows. Besides the telnet flows, the rest of flows has very high
accuracy for all application header cut-off thresholds using
the second classifier. Although using 8K cute-off threshold
we can achieve the highest total accuracy, using other cut-off
thresholds lead to high accuracy as well.
8 32 128 512 1024 4096 81920
0.10.20.30.40.50.60.70.80.9
1
Header Cut−off Thresold (bytes)
Cla
ssifia
ton A
ccura
cy
8 32 128 512 1024 4096 81920
0.10.20.30.40.50.60.70.80.9
1
Header Cut−off Thresold (bytes)
Cla
ssifia
ton A
ccura
cy
8 32 128 512 1024 4096 81920
0.10.20.30.40.50.60.70.80.9
1
Header Cut−off Thresold (bytes)
Cla
ssifia
ton A
ccura
cy
8 32 128 512 1024 4096 81920
0.10.20.30.40.50.60.70.80.9
1
Header Cut−off Thresold (bytes)
Cla
ssifia
ton A
ccura
cy
https
http
imaps
imap
smtp
oovoo
rdp
vnc
skype
tftp
telnet
sftp
ftp
(b) Classification accuracy with training by first 1KB after cutt−off point of files for each flow application protocol using SVM
(c) Classification accuracy with training by first 1KB after cutt−off point of real−life flows for each flow application protocol using CART
(d) Classification accuracy with training by first 1KB after cutt−off point of real−life flows for each flow application protocol using SVM
(a) Classification accuracy with training by first 1KB after cutt−off point of filesfor each flow application protocol using CART
Fig. 24. Classification accuracy of real-life flows with header removal
3) Classification Accuracy on multi-nature flows: As ex-
plained in section IV-F1, flows may include contents with
different natures. To classify such flows based on their content
nature, we periodically buffer a flow content and reclassify
the flow based on the buffer content. Hence, to evaluate
classification accuracy, we divide a given flow into windows
and we consider each window as a new flow. To evaluate
Iustitia accuracy on multi-nature flows, we first label each
window based on its content to build our dataset. As each
window can include contents with multiple nature, we label
14
the window based on the nature of the content that is contained
in the most of the window. We then use cross-validation to
evaluate Iustitia classification accuracy on different window
sizes. The buffer size in these experiments is 1K bytes.
Fig. 25(a) shows the accuracy results when the classifier is
trained by random 1K of files. The accuracy for different
window sizes is, on average, 50.33% and 50.84% for CART
and SVM, respectively. Fig. 25(b) shows the accuracy results
when the classifier is trained by the first 1K each window. The
accuracy for different window sizes is significantly improved
to be 88.97% and 87.12%, on average, for CART and SVM,
respectively. The window size does not have significant impact
on the accuracy results. It can be chosen by the operator based
on the Iustitia application and the classification system load
and user demand to guarantee a good service and reliable
classifier.
1K 2K 4K 8K 16K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sample Window Size (Bytes)
Cla
ssific
ation A
ccura
cy
1K 2K 4K 8K 16K0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sample Window Size (Bytes)
CART
SVM−RBF kernel
(a) Multi−natural flow classification accuracy bytraining with random 1K bytes of files
(b) Multi−natural flow classification accuracy bytraining from multi−natural real−life flows
Fig. 25. Classification accuracy of multi-nature real-life flows
VII. CONCLUSION
We make three major contributions in this paper. First,
we propose the first method for classifying flows into text,
binary, and encrypted at high-speed with a small amount of
memory. Our method is mainly based on machine learning and
information theory techniques. Second, we extend our method
to perform finer grained classification of network flows and
achieve high classification accuracy. Third, we implemented
our method and performed extensive experiments.
REFERENCES
[1] “Internet2 NetFlow Statistics,” http://netflow.internet2.edu/.[2] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic
feature distributions,” SIGCOMM Comput. Commun. Rev., vol. 35, pp.217–228, 2005.
[3] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing network-wide trafficanomalies,” in Proc. SIGCOMM, 2004.
[4] W. Lee and D. Xiang, “Information-theoretic measures for anomalydetection,” in Proc. IEEE S&P, 2001.
[5] Y. Zhang and V. Paxson, “Detecting backdoors,” in Proc. USENIX
Security, 2000.[6] H. Dreger, A. Feldmann, M. Mai, V. Paxson, and R. Sommer, “Dynamic
application-layer protocol analysis for network intrusion detection,” inProc. 15th Conf. on USENIX Security Symposium, 2006.
[7] A. Dainotti, W. de Donato, and A. Pescape, “TIE: A Community-Oriented Traffic Classification Platform,” in Data Traffic Monitoring and
Analysis (TMA), 2009.[8] S. Lee, S. Lee, H. Kim, D. Barman, C.-K. Kim, T. T. Kwon, and
Y. Choi, “NeTraMark: A Network Traffic classification benchMark,” inACM SIGCOMM Computer Communication Review, 2011.
[9] J. P. Early, C. E. Brodley, and C. Rosenberg, “Behavioral authenticationof server flows,” in Proc. 19th Annual Computer Security Applications
(ACSAC) Conf. , 2003.[10] C. V. Wright, F. Monrose, and G. M. Masson, “On inferring application
protocol behaviors in encrypted network traffic,” Journal of Machine
Learning Research, vol. z, pp. 2745–2769, 2006.
[11] P. Dorfinger, G. Panholzer, and W. John, “Entropy estimation for real-time encrypted traffic identification,” in Proc. of TMA, 2011.
[12] R. Alshammari and A. N. Zincir-Heywood, “Can encrypted traffic beidentified without port numbers, ip addresses and payload inspection?”Computer Networks, vol. 55, no. 6, pp. 1326 – 1350, 2011.
[13] M. McDaniel and M. H. Heydari, “Content based file type detectionalgorithms,” in Proc. 36th Annual Hawaii Int. Conf. on System Sciences,2003, pp. 323–333.
[14] W.-J. Li, K. Wang, S. J. Stolfo, and B. Herzog, “Fileprints: identifyingfile types byn-gramanalysis,” in Proc. 2005 IEEE workshop on informa-
tion assurance,, 2005.[15] A. Finamore, M. Mellia, M. Meo, and D. Rossi, “Kiss: Stochastic packet
inspection classifier for udp traffic,” IEEE Transactions on Networking,vol. to apear, 2010.
[16] Y. Wang, Z. Zhang, L. Guo, and S. Li, “Using entropy to classifytraffic more deeply,” in Proc. IEEE International Conf. on Networking,
Architecture, and Storage, 2011.[17] K. Sayood, Introduction to Data Compression. Morgan Kaufmann,
1996.[18] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Regression Trees. Chapman & Hall, 1984.[19] V. N. Vapnik, The nature of statistical learning theory. Springer-Verlag
New York, Inc., 1995.[20] J. C. Platt, N. Cristianini, and J. Shawe-taylor, “Large margin DAGs for
multiclass classification,” in Advances in Neural Information Processing
Systems. MIT Press, 2000.[21] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass
support vector machines,” IEEE Transactions on Neural Networks,vol. 13, pp. 415–425, 2002.
[22] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Tran.
on Information Theory, vol. 37, pp. 145–151, 1991.[23] P. Somol, P. Pudil, J. Novovicova, and P. Paclik, “Adaptive floating
search methods in feature selection,” Pattern Recognition Letter, vol. 20,pp. 11–13, 1999.
[24] J. Postel and J. Reynolds, “File transfer protocol,” RFC 959, 1985.[25] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang, “Data streaming
algorithms for estimating entropy of network traffic,” in Proc. SIGMET-
RICS ’06, 2006, pp. 145–156.[26] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approx-
imating the frequency moments,” in Proc. ACM symposium on theory
of computing, 1996.[27] “Umass Trace Repository, http://traces.cs.umass.edu/.”[28] K. Suh, Y. Guo, J. Kurose, and D. Towsley, “Locating network monitors:
complexity, heuristics, and coverage,” in Proc. INFOCOM 2005, vol. 1,pp. 351–361, 2005.
[29] “File Magic Number,” http://www.astro.keele.ac.uk/oldusers/rno/Computing/File magic.html.
[30] R. C. Miller and K. Bharat, “Sphinx: A framework for cre-ating personal, site-specific web crawlers,” In Proc. of WWW
(http://www.cs.cmu.edu/ rcm/websphinx/), 1998.[31] “ooVoo LLC.” www.oovoo.com.[32] “Virtual Network Computing (VNC),” www.realvnc.com.
Amir R. Khakpour received his BS in ElectricalEngineering from University of Tehran in 2005 andMSc in Computer and Communication Networksfrom Telecom SudParis in 2007. He is graduatedwith a Ph.D. degree in Computer Science fromMichigan State University in 2012. His research in-terests include Internet, network security, and large-scale systems.
Alex X. Liu received his Ph.D. degree in computerscience from the University of Texas at Austin in2006. He received the IEEE & IFIP William C.Carter Award in 2004 and an NSF CAREER awardin 2009. He received the Withrow DistinguishedScholar Award in 2011 at Michigan State University.His research interests focus on networking, security,and dependable systems.