identifying application protocols in computer …...eij is an edge from vertex i to vertex j deg(v)...
TRANSCRIPT
IDENTIFYING APPLICATION PROTOCOLS IN COMPUTERNETWORKS USING VERTEX PROFILES
By
Edward G. Allan, Jr.
A Thesis Submitted to the Graduate Faculty of
WAKE FOREST UNIVERSITY
in Partial Fulfillment of the Requirements
for the Degree of
MASTER OF SCIENCE
in the Department of Computer Science
December 2008
Winston-Salem, North Carolina
Approved By:
Errin W. Fulp, Ph.D., Advisor
Examining Committee:
David J. John, Ph.D., Chairperson
William H. Turkett, Jr., Ph.D.
Acknowledgements
This thesis is the product of many people’s labors, not just my own. The ideascontained in the pages that follow have been formulated and refined for over a year,with the guidance and support of several people, whose assistance I would be remissnot to mention. I would like to thank Wake Forest University and GreatWall Systems,Inc. for their support. This research was funded by GreatWall Systems, Inc. via theUnited States Department of Energy STTR grant DE-FG02-06ER86274. 1
I would also like to thank my parents for their support throughout my years atWake Forest, both as an undergraduate and as a graduate student. Without theirencouragement and financial assistance, none of this would have been possible. I alsowould not be where I am today without the help of my friends, who have made thesepast several years some of the most enjoyable and most memorable yet.
My thesis committee members, Dr. David John and Dr. William Turkett, Jr.,were instrumental in providing me with feedback throughout the research and writingprocess. Their comments and criticism have undoubtedly enabled the success ofthis endeavor. I would especially like to thank Dr. Turkett for selflessly spendinghours assisting me and stepping in as my “adopted advisor” during Dr. Errin Fulp’ssabbatical.
Last, but certainly not least, I must thank my advisor, Dr. Errin Fulp. I havebeen fortunate to work with him in a variety of contexts for more than five yearsnow, and he has been a tremendous influence on both my personal and academicdevelopment. His relaxed personality and great sense of humor kept me off-task justenough to save my sanity, while his insight and guidance allowed me to complete mystudies and be ready to move on to the next chapter in my life. Many thanks againto all who have helped me along the way — you are much appreciated.
1The views and conclusions contained herein are those of the author and should not be interpretedas necessarily representing the official policies or endorsements, either expressed or implied, of theDOE or the U.S. Government.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Issues in Network Management and Security . . . . . . . . . . . . . . 2
1.2 Current Methods of Network Analysis . . . . . . . . . . . . . . . . . . 2
1.2.1 Applications and Port Numbers . . . . . . . . . . . . . . . . . 3
1.2.2 Packet Inspection . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Interdisciplinary Study of Network Communications . . . . . . . . . . 4
1.3.1 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Biological Networks and Motifs . . . . . . . . . . . . . . . . . 6
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2 Computer Networks and Communications. . . . . . . . . . . . . . . 8
2.1 Network Topologies and Architectures . . . . . . . . . . . . . . . . . 8
2.2 Computer Network Reference Models . . . . . . . . . . . . . . . . . . 10
2.2.1 The OSI Model . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 The TCP/IP Model . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Layer 3: The Network Layer . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Layer 4: The Transport Layer . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Layer 7: The Application Layer . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Graph Terminology and Basic Properties . . . . . . . . . . . . . . . . 16
3.2 Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Traditional Graph Measures . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Distances and Path Lengths . . . . . . . . . . . . . . . . . . . 18
3.3.2 Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Clustering Coefficient . . . . . . . . . . . . . . . . . . . . . . . 21
iii
iv
3.3.4 Application of Traditional Graph Measures in Computer Net-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Network Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 Definition of a Motif . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.2 Function of Motifs . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Analysis of Application Graphs . . . . . . . . . . . . . . . . . . . . . 25
Chapter 4 Data Selection and Considerations . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Network Trace Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Challenges Associated with Network Data Collection . . . . . . . . . 26
4.2.1 Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Privacy and Sanitization of Data . . . . . . . . . . . . . . . . 28
4.2.3 Network and Data View . . . . . . . . . . . . . . . . . . . . . 29
4.3 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Dartmouth College Wireless Traces . . . . . . . . . . . . . . . 31
4.3.2 LBNL/ICSI Enterprise Tracing Program . . . . . . . . . . . . 31
4.3.3 OSDI Conference Network Traces . . . . . . . . . . . . . . . . 31
4.4 Protocol Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Hardware and Linux System . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Packet Capture and Storage . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Creation of Application Graphs . . . . . . . . . . . . . . . . . . . . . 37
5.4 Traditional Graph Measures . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Motif Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Vertex Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 K-Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . 44
5.7.1 Measuring Profile Separation . . . . . . . . . . . . . . . . . . 45
5.7.2 Cross Validation of Classification Results . . . . . . . . . . . . 46
5.8 Genetic Algorithm Feature Weighting . . . . . . . . . . . . . . . . . . 46
5.8.1 Overview of Genetic Algorithms . . . . . . . . . . . . . . . . . 47
5.8.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Preliminary Investigations . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Initial Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.1 Traditional Graph Measure Profiles . . . . . . . . . . . . . . . 51
6.2.2 Motif-based Profiles . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 Weighted Profiles and Key Attributes . . . . . . . . . . . . . . . . . . 57
v
6.3.1 Attribute Weights of Traditional Graph Measures . . . . . . . 58
6.3.2 Attribute Weights of Motif-based Measures . . . . . . . . . . . 59
6.4 Comparison of Profile Types . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Considerations for Optimizing Classifier Performance . . . . . . . . . 63
6.6 Limitations of Current Approach . . . . . . . . . . . . . . . . . . . . 66
Chapter 7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Appendix A Examples of Application Graphs . . . . . . . . . . . . . . . . . . . . . . 76
Appendix B Code Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Appendix C Test Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Appendix D Additional Classification Results . . . . . . . . . . . . . . . . . . . . . . 87
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Illustrations
List of Tables
4.1 Summary statistics of three trace files examined . . . . . . . . . . . . 31
5.1 Graph orders for each application protocol . . . . . . . . . . . . . . . 38
6.1 Classification accuracy of 65 application graphs . . . . . . . . . . . . 50
6.2 An example confusion matrix with three classes . . . . . . . . . . . . 50
6.3 Confusion matrix of unweighted traditional graph measures . . . . . . 52
6.4 Number of single and multi-class ties for traditional graph measures 53
6.5 Confusion matrix of unweighted motif-based profiles . . . . . . . . . 55
6.6 Number of single and multi-class ties for motif-based profiles . . . . 55
6.7 Percentage of original data used in motif-based profiles . . . . . . . . 57
6.8 Attribute weights for traditional graph measures . . . . . . . . . . . 58
C.1 FANMOD test parameters . . . . . . . . . . . . . . . . . . . . . . . . 85
D.1 Confusion matrix of 65 application graphs using motif frequencies . . 87
D.2 Confusion matrix of weighted traditional graph measures . . . . . . . 87
D.3 Confusion matrix of weighted motif profiles . . . . . . . . . . . . . . 87
List of Figures
1.1 Example output from NetStat . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Graphical depiction of a social network with two distinctly visible clus-ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Four network topologies: bus, ring, star and mesh [1] . . . . . . . . . 9
2.2 The OSI and TCP/IP reference models [2] . . . . . . . . . . . . . . . 11
2.3 An IP datagram header [2] . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 UDP and TCP datagram headers [2] . . . . . . . . . . . . . . . . . . 14
2.5 Example communication between a client and a web server . . . . . 15
3.1 A graph with five nodes and five edges . . . . . . . . . . . . . . . . . 17
3.2 Schematic view of motif detection [3] . . . . . . . . . . . . . . . . . . 23
3.3 All 13 configurations of order 3 connected subgraphs [3] . . . . . . . 24
vi
vii
3.4 A feed-forward loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Tcpdump output containing timestamp, protocol, source IP, sourceport, destination IP, destination port, packet length and packet flags 27
5.1 Overview of the proposed methodology and tools used . . . . . . . . 36
5.2 Storing packets from a pcap file into a MySQL database . . . . . . . 37
5.3 A motif with colored vertices . . . . . . . . . . . . . . . . . . . . . . 41
5.4 FANMOD edge-switching process for generating random networks [4] 42
5.5 Arrays representing vertex profiles . . . . . . . . . . . . . . . . . . . . 43
5.6 Single-point crossover of two binary strings . . . . . . . . . . . . . . 48
6.1 Profile collisions for traditional graph measures . . . . . . . . . . . . 54
6.2 Profile collisions for motif-based profiles . . . . . . . . . . . . . . . . 56
6.3 Depiction of three application graphs: HTTP, AIM and SSH . . . . . 57
6.4 Accuracy of unweighted vs. weighted traditional graph measure profiles 59
6.5 The ten highest-weighted motifs and their corresponding weights . . 60
6.6 Accuracy of unweighted vs. weighted motif-based profiles . . . . . . 61
6.7 Accuracy comparison of unweighted profile types . . . . . . . . . . . 62
6.8 Accuracy of single attribute classification . . . . . . . . . . . . . . . 64
6.9 Comparison of profile types as the size of the training set increases . 65
A.1 Application graphs depicting AIM communications . . . . . . . . . . 76
A.2 Application graphs depicting DNS communications . . . . . . . . . . 76
A.3 Application graphs depicting HTTP communications . . . . . . . . . 76
A.4 Application graphs depicting Kazaa communications . . . . . . . . . 77
A.5 Application graphs depicting MSDS communications . . . . . . . . . 77
A.6 Application graphs depicting Netbios communications . . . . . . . . 77
A.7 Application graphs depicting SSH communications . . . . . . . . . . 77
Abbreviations
Acronyms
AIM - AOL Instant MessengerTM
API - Application Programming Interface
AUP - Acceptable Use Policy
DNS - Domain Name Service
FFL - Feed-forward loop
HTTP - HyperText Transfer Protocol
IANA - Internet Assigned Numbers Authority
IDS - Intrusion Detection System
IP - Internet Protocol
MSDS - Microsoft Directory Share
OSI - Open Systems Interconnection
P2P - Peer-to-peer
SANSTM - SysAdmin, Audit, Networking, and Security
SMTP - Simple Mail Transfer Protocol
SSH - Secure Shell
TCP - Transmission Control Protocol
UDP - User Datagram Protocol
VoIP - Voice over IP
viii
ix
Symbols
| V | is the number of vertices in a graph
eij is an edge from vertex i to vertex j
deg(v) is the degree of vertex v
id(v) is the indegree of vertex v
od(v) is the outdgree of vertex v
N(v) is the set of nodes in the neighborhood of vertex v
e(v) is the eccentricity of vertex v
rad(G) is the radius of graph G
diam(G) is the diameter of graph G
d(u, v) is the distance between vertex u and vertex v
CD(v) is the degree centrality of vertex v
CB(v) is the betweenness centrality of vertex v
CC(v) is the closeness centrality of vertex v
xi is the eigenvector centrality of vertex i
C(v) is the clustering coefficient of vertex v
φ is a port number associated with an application (e.g., 80 for HTTP)
Abstract
Edward G. Allan, Jr.
Identifying Application Protocols in Computer
Networks Using Vertex Profiles
Thesis under the direction of Errin W. Fulp, Ph.D., Associate Professor ofComputer Science
Security and management of computer network resources exemplify two criticalactivities that challenge system administrators. They face potential threats from out-side intruders as well as internal users whom already have access to the organization’sassets. It is imperative that administrators are aware of what applications are beingexecuted, but the use of data encryption techniques and non-standard port numberspresents difficulties that must be overcome.
To that end, this thesis introduces a novel method to identify application protocolsbased on the analysis of application graphs, which model application-level communica-tions between computers. The performance of two types of node descriptions, calledvertex profiles, are compared. “Traditional” vertex profiles characterize each nodeusing several well-studied graph measures. Furthermore, this work uniquely appliesmotif-based analysis, which has previously been used primarily in systems biology, tothe study of application graphs by creating a second type of vertex profile based ona node’s participation in statistically significant motifs. Machine learning techniquesare employed to evaluate the importance of specific profile features. The experimen-tal results, using a nearest-neighbor classifier, show that this type of analysis cancorrectly classify the applications observed with greater than 80% accuracy.
x
Chapter 1: Introduction
Managing and securing today’s critical data networks is a daunting and expensive
task. According to INPUT [5], demand for vendor-furnished information systems
and services by the U.S. government will increase from $71.9 billion in 2008 to $87.8
billion in 2013. This money funds such tasks as system modernization, information
sharing, IT management and information security. As computer networks increase in
size, speed and complexity, and malicious hackers develop more sophisticated attacks,
traditional methods of managing and securing these networks begin to break down.
This thesis proposes a novel approach to identifying the actions of hosts within a
network by examining the properties of application graphs, which model the social and
functional interactions of hosts with one another at the software application level (e.g.
HTTP, FTP, etc.). With the aid of machine learning techniques and algorithms, this
method exploits graph characteristics of each host in the application graph, such as its
connectedness, its position in the graph and the shapes of the subgraphs in which it is
found. One distinct advantage to this approach is that classification can be performed
“in the dark”, meaning that the packet payloads are either unavailable or have been
encrypted, rendering deep packet inspection futile. Knowing what activities users on
the network are participating in is crucial to network administrators who must manage
bandwidth allocations, network configurations, performance and security and access
policies. The following sections of this chapter provide background information and
motivation for the study.
1
2
1.1 Issues in Network Management and Security
To protect itself from litigation and to help ensure the integrity of its network, an
organization (such as a school, business, or government) will often develop an Accept-
able Use Policy, or AUP. An AUP defines what behaviors are acceptable for internet
browsing, what applications can be run by users and other relevant guidelines for
usage. The SANS Security Policy Project [6] provides several resources and tem-
plates for such policies. Take, for example, a policy that does not allow users to run
a personal web server using an organization’s computing resources. Identifying such
behavior can help to preserve network bandwidth that is otherwise used for legitimate
business activities.
Not only can failure to comply with an organization’s AUP waste computing
resources, it can also have serious security implications as well. Continuing with
the example above, running an improperly configured web server or hosting insecure
web application files gives an attacker an easy point of entry into the network. A
study performed by MITRE from 2001-2006 notes a sharp increase in the number of
public reports for vulnerabilities that are specific to web applications [7]. For several
years buffer overflow attacks had been the most common, but were overtaken in 2005
by web application vulnerabilities such as SQL injection, cross-site scripting (XSS)
and remote file inclusion. It is, therefore, in a network administrator’s best interest
to ensure that the network is properly utilized in accordance with the policies and
guidelines adopted by the organization.
1.2 Current Methods of Network Analysis
Several tools allow system administrators to determine which applications are being
used on a network. This information assists them in the maintenance and protection
3
of networked systems. Sophisticated users, however, are able to hide their activities,
which could potentially include actions that are against the organization’s AUP, or
worse yet, are illegal. This section examines a few of the tools used by administrators
and identifies some of their weaknesses.
1.2.1 Applications and Port Numbers
When data is sent to a computer over a network, the destination port number identifies
which application on the host computer should receive and process the data. Many
applications use port numbers specified by the Internet Assigned Numbers Authority
[8]. For example, FTP servers use ports 20 and 21, while web servers use port 80
by default. NetStat is a command line tool that shows information about network
connections, both incoming and outgoing [9]. Figure 1.1 demonstrates the output of
the NetStat command.
$ netstat -taActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address Foreign Address Statetcp 0 0 localhost:2208 *:* LISTENtcp 0 0 *:sunrpc *:* LISTENtcp 0 0 *:auth *:* LISTENtcp 0 0 *:35763 *:* LISTENtcp 0 0 localhost:ipp *:* LISTENtcp 0 0 localhost:smtp *:* LISTENtcp 0 0 localhost:36699 *:* LISTENtcp6 0 0 *:ssh *:* LISTEN
Figure 1.1: Example output from NetStat
Network administrators could look and see that a host on the network is listening
on port 80, indicating the presence of a web server. The administrator could then
shut down that service and take appropriate disciplinary action toward the user.
The problem with this method of detecting network applications is that while many
do run on a known port number, they do not necessarily have to. If a web server
were reconfigured to listen for connections on port 6000, clients could still connect
to it through their web browser by typing http://www.example.com:6000. A user
4
wishing to hide their activities might attempt to disguise an application by using such
a non-standard port number. Chapter 2 describes port numbers and other networking
concepts in more detail.
1.2.2 Packet Inspection
Another method of detecting network applications is to scrutinize the data contained
in each packet as it traverses the network. Packets contain information such as HTTP
requests, email headers and MP3 filename searches, as well as protocol-specific session
initiations and version numbers that can be used to identify a particular application.
Wireshark is a popular network protocol analyzer that has several useful features for
viewing packet contents, reassembling sessions and gathering statistics about network
data [10]. Packet inspection is commonly used in intrusion detection systems (IDS)
such as Snort [11]. A rule-based engine searches packet data, compares it against
a list of known attacks and generates a predefined response (such as notifying an
administrator). The problem with packet inspection is that traffic is increasingly
encrypted. Data payloads that have been transformed into cyphertext are not human-
readable until they are decrypted with the appropriate key, nor do the payloads match
the known attack strings in the case of IDS.
1.3 Interdisciplinary Study of Network Communications
It is therefore the goal of this study to look beyond current methods for identifying
network behavior and propose a novel approach that relies upon high level commu-
nication patterns observed among hosts. To accomplish this goal, this study borrows
ideas and algorithms from several disciplines. Networks are not unique to computer
science; they exist in mathematics, sociology, biology, communications and other ar-
eas of study as well. Graphs, a collection of objects (sometimes called nodes) linked
5
by edges, are the abstract model which allows for the analysis of any type of network.
They can represent relationships among friends, the interaction of biological entities
in a transcriptional regulation network, the collaboration between authors of research
papers [12], as well as a myriad of other problem spaces. Chapter 3 illustrates the
properties of graphs in more depth.
1.3.1 Social Networks
One key area of study that this thesis borrows from is social network analysis, which
focuses on relationships among social entities (also known as actors), and on the
patterns and implications of these relationships [13]. The properties of social graphs
reveal interesting information such as the spread of disease or material goods through
the network, as well as what actors are “influential” (politically, socially, etc.). Social
network analysis also has military and intelligence applications. Yang and Ng provide
visualizations and analysis of weblog social networks related to terrorism and other
crime-related matters [14].
To provide a simple working example of social network analysis, Figure 1.2 depicts
the author’s social network of friendships taken from the popular social networking
web site FacebookTM. There are two clearly visible “clusters” of friends visible in
the graph, created by nodes in each cluster sharing many common links with other
nodes in the cluster. In the context of this social network, it means that many of
the author’s friends in each group are also friends with each other. The group on the
left is primarily comprised of relationships formed during the author’s tenure at Wake
Forest University, while the cluster on the right is primarily comprised of relationships
formed prior to and during high school.
Several concepts pertaining to social networks can be extended to the study of
application graphs performed in this work. Application graphs model the social rela-
6
Figure 1.2: Graphical depiction of a social network with two distinctly visible clusters
tionships between clients and servers in a computer network by showing with which
web servers users choose to interact, with whom they communicate via instant mes-
saging clients and with whom they choose to share files. For example, the application
graph for AOL Instant MessengerTM might show several chat clients communicating
with a central chat server, which then passes messages along to the intended recipi-
ents. Characteristics of these high-level interactions are used to identify the software
application through which the communication occurs. Section 3.3 elaborates upon
the graph measures frequently used to quantify aspects of social networks.
1.3.2 Biological Networks and Motifs
The study of biological networks is another key field from which ideas for this thesis
are borrowed. Cellular processes are regulated by the interactions of several molecules
such as proteins and DNA [15]. These complex interactions can be modeled as graphs.
One particular method used to analyze these graphs is to search within them for mo-
tifs: recurring, significant patterns of interconnections. Milo et al. find motifs in
several types of networks including biochemistry, neurobiology, ecology and engineer-
ing. They suggest that motifs are the basic structural elements capable of defining
broad classes of networks [3].
7
Motif analysis is often used in biology [3, 16, 17, 18], but has not yet been applied
to application graphs. One goal of this study is to determine if a motif or groups of
motifs can help identify what application a computer is using. It finds that several
protocols use similar motifs, partly due to the fact that many applications have a
client-server architecture (described in Section 2.1). However, there is still enough
distinction in how the applications are used at a social level to determine what they
are based on the models developed in this work. Chapter 6 discusses some of the
motifs found in application graphs.
1.4 Outline
The following is an outline of the remaining parts of this thesis. Chapter 2 covers
information regarding computer networks, the different reference models and details
the network layers used to create application graphs. Chapter 3 introduces several
concepts relating to graph theory, “traditional” measurement techniques of graphs
and provides more information about motifs. Data sources and application protocol
selection is covered in Chapter 4. Chapter 5 specifies the tools used in this thesis
and introduces machine learning techniques used for the modeling and classification
of application types. A discussion of the results obtained and an analysis of key
motifs and graph metrics is handled in Chapter 6, as well as a comparison between
traditional graph measures and a motif-based approach. Finally Chapter 7 concludes
this study and explores possible topics for future research.
Chapter 2: Computer Networks and
Communications
Undoubtedly the interconnection of computers and networks to the world wide web
has increased mankind’s ability to share information, perform research and become
more efficient at everyday tasks. However, not all users have benign intentions. Illegal
hacking, cyber terrorism and fraud wreak havoc on governments, corporations and
individuals alike. Data encryption is often used to disguise malicious activity as well
as legitimate activity from observation. By exploring the communication patterns
found within networks, this study shows that it is still possible to gain some insight
into what applications are being utilized. The following sections introduce several
basic concepts related to network architectures, protocols and applications.
2.1 Network Topologies and Architectures
Network topologies describe the arrangement and mapping of networked elements,
such as computers, printers, wires and routers. Mappings can be physical or logical.
Physical topology describes where the elements are actually located and how they are
interconnected with wires. Logical topology on the other hand, referrs to the path
data appears to take when traveling from one network host to another [1]. A network’s
logical topology might be very different from the underlying physical topology, but it
is bound by the network protocols that direct how the data moves across the network.
Application graphs are a generalization of logical topologies in that they provide a
picture of how data moves between hosts, but from a very high-level view.
There are several shapes used to describe network topologies including bus, tree,
star, mesh and ring. In the case of a physical network, these shapes have an impact on8
9
Figure 2.1: Four network topologies: bus, ring, star and mesh [1]
network performance, reliability and ease of management. For example, a bus network
is cost-effective and easy to implement, but the architecture can only support a limited
number of hosts and a bad cable will bring down the entire network. A star network
allows for the isolation of the periphery nodes, but the central hub might be a single
point of failure for the network. Logical topologies show the exchange of information
between entities that are not physically connected by the network infrastructure. For
example, IBM’s Token Ring network technology is a logical ring but is physically
wired in a star topology.
In terms of software application models, two prevalent architectures are found
in computer networks: the client-server model and peer-to-peer (P2P) architectures.
In the client-server model, a client machine is responsible for initiating a request to
some application running on another computer. The server waits for an incoming
request from a client and then sends a response. Client-server architecture allows for
computing responsibilities to be divided up among servers in the network, where one
computer might act as a web server, another as an email server and so on. While the
data sent between the client and server might go through several network devices, the
logical data flow is a single link between the two nodes. A star network could then be
induced by several clients connecting to a common server (see Figure 2.1). In a P2P
10
network, nodes both initiate and respond to requests from other computers on the
network known as peers. Consequently, the logical topology of such interactions could
form a mesh network. This study examines the characteristics of logical topologies
extended to the application layer, modeled as application graphs.
2.2 Computer Network Reference Models
Application graphs are created using information from several layers of the network
communication process. Data goes through a series of transformations before being
sent to its destination, including breaking the data into manageable fragment sizes,
adding quality of service information, specifying how the data should be transmitted
and converting it into the electrical pulses that traverse the wire. Three layers in
particular are of interest: the network, transport and application layers, described in
Sections 2.3–2.5.
There are two fundamental models referenced when describing network layers: the
OSI model and the TCP/IP model. The protocols (rules that govern the syntax and
meaning of data sent between entities) associated with the OSI model are rarely used,
but the features described at each layer are still important. In contrast, the TCP/IP
model is not as rigidly defined as the OSI model, but the protocols associated with
it are widely used [2]. This section provides an overview of these models, depicted in
Figure 2.2.
2.2.1 The OSI Model
The Open Systems Interconnection Basic Reference Model (OSI Model) was designed
to promote international standardization of the protocols used in communication
networks. There are seven layers in this model: the physical layer, data link layer,
network layer, transport layer, session layer, presentation layer and application layer
11
[19]. The physical layer deals with representing and transmitting raw bits over a
communication channel. Well known examples include Ethernet over twisted pair
(10BASE-T, 100BASE-TX) and 802.11/a/b/g wireless standards. The task of the
data link layer is to correct transmission errors from the physical layer and provide
the means to enable point-to-point communication between hosts within a local area
network. This layer arranges data into frames and also provides medium access control
to share communication channels between multiple users.
The network layer determines how packets are routed from the source to the desti-
nation, allows the interconnection of heterogeneous networks and provides congestion
control. The next layer in the model, the transport layer, provides logical commu-
nication between processes on the hosts and is the first true end-to-end layer in the
model. The session and presentation layers are not generally used; their intent is
to provide session management between hosts, synchronization, interruption recovery
and “on the wire” management of abstract data structures. The final layer in the
OSI model is the application layer. This is the layer at which a user directly interacts
with the program (a web browser, for example) that sends network data.
Figure 2.2: The OSI and TCP/IP reference models [2]
12
2.2.2 The TCP/IP Model
First proposed in 1974, the TCP/IP model [20] presents a slightly different view of
network communications with four layers that are not as strictly defined as those
in the OSI model. Whereas the OSI model was developed before the associated
protocols, the TCP/IP model was developed based on protocols that already existed,
taking its name from its two key protocols. The host-to-network layer is somewhat
ill-defined and does not specify the protocols necessary for a host to send packets to
the internet layer. It combines elements of the OSI model’s physical and data link
layers. The internet layer is analogous to layer 3 of the OSI model. Familiar protocols
like IP (Internet Protocol) and ICMP (Internet Control Message Protocol) are a part
of this layer.
The third layer of the TCP/IP model is the transport layer, which maps directly
to the transport layer of the OSI model. It allows for end-to-end communication of
hosts on a network, using the TCP (Transmission Control) and UDP (User Datagram)
protocols. A need for the session and presentation layers was not perceived, so the
TCP/IP reference model does not contain them explicitly. The fourth layer, the
application layer, will contain them if necessary. This layer contains all of the high
level protocols such as HTTP, SMTP and DNS.
Although there are certainly similarities between several layers of the two reference
models, this paper will use OSI model terminology. This allows for a finer distinction
between network services offered at each layer to be made. The important lower-level
protocols for application graphs, however, are those that were originally associated
with the TCP/IP model, namely TCP and UDP.
13
2.3 Layer 3: The Network Layer
The network layer is concerned primarily with delivering packets from one host to
another through a series of routers. It attempts to maintain some quality of service
for variables such as delay, transit time and jitter while forwarding packets along until
the destination is reached.
Figure 2.3: An IP datagram header [2]
Figure 2.3 shows all of the fields contained in the header of an IP data packet.
For modeling network communications, however, only two fields are of interest: the
source address and the destination address. Each IP address identifies a unique node
in an application graph. The protocol field tells the network layer which transport
process to give the data to. Two common options are TCP and UDP, described next.
2.4 Layer 4: The Transport Layer
The transport layer is responsible for getting data to and from applications running
on the host machine, providing logical end-to-end communication between the appli-
cations. There are two types of service available to the upper layers, connectionless or
connection-oriented. The simpler of the two is connectionless, implemented by UDP.
The delivery and ordering of UDP packets is unreliable, but there is less connection
overhead associated with the transfer. Connection-oriented service, provided by TCP,
establishes several properties of the transmission ahead of time, such as data window
14
sizes and congestion control mechanisms. TCP packets are given sequence numbers
that are kept in order. Although IP networks are still only “best-effort” as no re-
sources are reserved ahead of time, TCP provides reliable communication between
hosts.
(a) UDP header
(b) TCP header
Figure 2.4: UDP and TCP datagram headers [2]
TCP and UDP headers (Figure 2.4) contain fields for the source and destination
port numbers. Port numbers serve as numerical identifiers for processes. They are 16
bits in length, resulting in 216 possible ports, numbered 0 through 65535. The Internet
Assigned Numbers Authority (IANA) is responsible for maintaining assignments of
port numbers for specific uses [8].
2.5 Layer 7: The Application Layer
The primary objective of this thesis is to identify application usage via communication
patterns at the application layer. Although not 100% accurate, port numbers are used
15
as the application labeling scheme for training the application classifier, described in
Chapter 5. Some applications communicate on certain port numbers with a high
degree of reliability. For example, when a user opens a web browser and requests
a web page, a connection is established from the user’s computer from a randomly
assigned upper port number to port 80 of the web server hosting the page. In this
case, Hypertext Transfer Protocol (HTTP) is the layer 7 application protocol used,
with the web server listening for connections on port 80, the IANA official port for
the HTTP protocol. This process is depicted in Figure 2.5.
Source Destination192.168.1.100:29985 → 208.122.19.56:80 User requests a web document208.122.19.56:80 → 192.168.1.100:29985 Server responds to request
Figure 2.5: Example communication between a client and a web server
There is no shortage of application layer protocols. Common examples include
SMTP or POP3 for email services, DNS for domain name resolution, peer-to-peer
protocols like BitTorrent and many, others. This study focuses on seven applications
that reflect a variety of application types and also have official port assignments from
the IANA. Protocol selection is detailed in Chapter 4, while the steps taken to create
application graphs based on the layer 3, 4 and 7 information are detailed in Section
5.3.
Chapter 3: Graph Analysis
Graphs are a well-studied concept in mathematics, dating back to Leonhard Eu-
ler’s 1736 analysis of the Seven Bridges of Konigsberg which laid many of the foun-
dations of graph theory [21]. Simply put, graphs are a collection of objects with
connections between them. These abstract structures model problems in a variety
of areas, including logistics, communication systems, biological and chemical com-
pounds and social-group structures [22]. The first part of this chapter reviews the
basic concepts and terminology required by the study of application graphs and then
introduces several “traditional” measures used to describe graphs. In the latter half
of this chapter, network motifs are defined in terms of their graph characteristics and
are related to application graphs.
3.1 Graph Terminology and Basic Properties
Unfortunately, some of the mathematical notation used in graph theory tends to differ
from text to text. Many of the basic properties and definitions are standard, but for
those that are not, this thesis borrows notation primarily from two sources: Chartrand
and Zhang [23], and Busacker and Saaty [22]. Abbreviations and function-like syntax
replace many Greek letters in this style of notation to avoid confusion. For example,
x(G) indicates that x is a property of the entire graph, whereas y(v) indicates y is a
property local to a particular vertex.
Vertices (or nodes, as they are often called in computer science) are the funda-
mental units in a graph. They can represent any object, such as a person, process,
city, or a computer. Vertices are linked together by edges, which show a relationship
16
17
between the vertices they connect. Some examples include roads connecting cities,
social interactions between people, or physical links between computers in a network.
A graph is a collection of vertices and edges taken together. Formally, a graph G con-
sists of a finite, non-empty set of vertices V , connected by a set of edges E, written
as G = (V, E). This definition implies that a graph must have at least one vertex in
it, but it does not necessarily have to contain any edges.
Figure 3.1: A graph with five nodes and five edges
The set of vertices V is written V = {v0, v1, . . . , vk}. The cardinality of this set,
| V |, is the order, or number of nodes in the graph. A graph’s edge set is defined as
E ⊆ {{u, v} | u, v ∈ V }. For brevity, an edge can be written eij to mean an edge
linking node i to node j. |E | is the number of edges in the graph, known as its size.
The degree of a node, deg(v), is the number of nodes that v is adjacent to in the graph
(those that can be reached by traversing one edge). This set of nodes is known as
N(v), the neighborhood of v. In Figure 3.1, nodes 2 and 3 are adjacent to node 1, and
N(1) = {2, 3}.
3.2 Types of Graphs
Modeling complex systems often requires more detail than just nodes and vertices as
described in the previous section. One possible approach is to orient the graph to
show asymmetric relationships between objects. In an undirected graph, the edges
are pairs of unordered vertices, that is, eij = eji. The edges in a directed graph,
however, are ordered pairs, and eij 6= eji. The degree measure can be extended to
include the indegree, id(v), and outdegree, od(v), of a vertex to describe the number
18
of vertices of G from which v is adjacent, and the number of vertices in G to which
v is adjacent, respectively. The associated undirected graph of a directed graph is
obtained by disregarding the ordering of the end points of each edge.
The assembly line process for building an automobile can be modeled as a directed
graph, where each stage of the process is represented by a node in the graph. The
directionality of the edges indicate that each step follows in a specified order and that
the process cannot happen in the reverse order. Edges of a graph can be weighted,
usually with an integer or real number, to imply a “cost” associated with traversing
an edge, or to further describe how the edge is used within the overall system. In
the auto assembly line graph, an edge weight could represent the amount of time a
particular step in the process takes.
3.3 Traditional Graph Measures
Several graph measures exist to describe the structure of a network, such as how
connected a vertex is, its distance from other vertices, and how it is positioned in
the graph. These measures have been used to characterize many different types of
networks and describe their growth patterns [24]. The following sections define the
measures selected for this study and provide examples of several of the concepts.
3.3.1 Distances and Path Lengths
The distance between two nodes u and v, written d(u, v), is the length of the shortest
path between them. In an unweighted graph, this is equal to the number of edges in
the path. In a weighted graph, the length of path P is∑
w(e) for e ∈ P . Dijkstra’s
algorithm [25] is one common method for determining this path through a network.
For a vertex v in a connected graph, the eccentricity of v, e(v), is the distance
between v and a vertex farthest from v in G. The radius of a graph rad(G) =
19
min{e(v) | ∀ v ∈ V } and the diameter diam(G) = max{e(v) | ∀ v ∈ V }. A vertex is
said to be central if e(v) = rad(G) and periphery if e(v) = diam(G).
In Figure 3.1 (reproduced above for convenience), e(1) = 3 because node 5 is the
node farthest away from node 1 in the graph and requires traversing three edges to
reach it. The radius of the graph rad(G) = 2 because e(1) = e(2) = e(5) = 3, but
e(3) = e(4) = 2. Also, diam(G) = 3, the maximal eccentricity value of all nodes in
the graph. According to the definitions above, nodes 3 and 4 are central, while nodes
1, 2 and 5 are said to be periphery nodes.
3.3.2 Centrality Measures
It is helpful to describe the centrality measures of a graph in terms of social networks in
order to make an analogy: the centrality measures of a vertex indicate how important,
prominent, or powerful the vertex is in a graph. The following is a brief examination
of four common centrality measures proposed by Freeman and Bonacich [26, 27]. The
most basic of these is degree centrality, or CD(v), defined as deg(v)|V |−1
. This equation
can be modified for directed networks to produce CDin and CDout. In terms of social
network analysis, indegree is interpreted as a a measure of popularity, while outdegree
is interpreted as gregariousness. In a dense adjacency matrix representation of a
graph, the time required to calculate the degree centrality for all nodes is O(V 2),
since all combinations of vertices must be considered.
20
Betweenness centrality is the fraction of shortest paths between all pairs of vertices
that pass through a particular vertex v. This measure is given by the equation:
CB(v) =∑
s 6=v 6=t∈Vs 6=t
δst(v)
δst
(3.1)
where δst is the number of shortest paths from s to t, and δst(v) is the number of
shortest paths from s to t that pass through v. A vertex with a higher betweenness
centrality occurs on more shortest paths than those that do not. This measure can
indicate how “powerful” a vertex is, because it influences the spread of information
through a network. O(V 3) calculations are required to determine betweenness and
closeness (described next) using the Floyd-Warshall algorithm to find all shortest
paths.
Closeness centrality is defined as the average shortest path length between a vertex
v and all other vertices reachable from it. In network theory it is regarded as a measure
of how long it will take information to spread from one vertex to the other reachable
vertices in the graph. Closeness centrality is given by:
CC(v) =
∑
t∈V
d(v, t)
n− 1(3.2)
where n ≥ 2 is the number of vertices reachable from v. Those vertices in G that
have shorter paths to other vertices will have a higher closeness centrality.
The eigenvector centrality is a more sophisticated version of the degree count of
a vertex, acknowledging that not all connections within a network are equal. The
eigenvector centrality score of a vertex i is proportional to the average degree of
i’s neighbors. In social networks, this reflects the idea that people connected to
influential people will themselves be more influential than if they were connected to
21
less influential people [28]. If the graph is represented as an adjacency matrix A where
Aij = 1 if node i is connected to node j, and Aij = 0 otherwise, eigenvector centrality
can be written:
xi =1
λ
|V |∑
j=1
Aijxj , (3.3)
where λ is a constant, and xi is the degree count of vertex i, and xj is the degree
count of vertex j. Defining the vector of centralities x = (x1, x2, . . . ), the previous
equation can be rewritten as
λx = A · x (3.4)
To force the centralities to be non-negative, it can be shown that λ must be the largest
eigenvalue of A, and x the corresponding eigenvector [28].
3.3.3 Clustering Coefficient
The clustering coefficient measure begins to extract a little bit more information about
the shape of structures within the graph, whereas many of the previous measures rely
on information about paths and path lengths between nodes. Instead, clustering
coefficient of v measures the percentage of edges that exist among neighborhood of
v, divided by the number of edges that could possibly exist among them. For an
undirected graph, the clustering coefficient is defined by the following equation:
C(v) =2 |{ejk}|
deg(v)(deg(v)− 1): vj , vk ∈ N(v), ejk ∈ E (3.5)
Another way to view clustering is the ratio of triangles (three nodes connected by
three edges) to the number of triples (three nodes and two edges, both incident to
v) that exist in the neighborhood of v. It has been shown in some types of networks
22
that if v1 connects to v2 and v2 connects to v3, then there is a greater chance that v1
and v3 will be connected as well [29, 28].
3.3.4 Application of Traditional Graph Measures in Com-
puter Networks
Past studies have looked at graph characteristics for the purpose of anomaly detec-
tion and traffic classification. Staniford et al.’s GrIDS system [30] generates graphs
describing communications between IP addresses and can generate alerts based on a
set of rules, such as a vertex degree count crossing some threshold value. The BLINC
traffic profiling system developed by Karagiannis et al. examines the interactions be-
tween hosts to identify an application, and utilizes measures including degree counts
and neighborhood information [31]. This thesis is similar to the BLINC study in
that they both evaluate interactions among hosts at the functional and social lev-
els in order to identify applications. The BLINC study, however, exploits additional
information such as the transport protocol and average packet size and attempts to
match network behavior to a library of empirically derived “graphlets”. In contrast,
this study examines a wider variety of graph measures, and also proposes the unique
approach of searching application graphs for motifs.
3.4 Network Motifs
A network motif is a pattern of interconnections that occurs in a graph significantly
more often than it does in randomized networks. Studies performed by Milo et al.
find motifs in several types of complex networks, and that a small number of network
motifs occur repeatedly across network types. They describe motifs as fundamental
building blocks of networks, capable of defining universal classes of networks [3, 16].
Research suggests that some motifs can be associated with a particular function,
23
discussed in Section 3.4.2. The work performed in this thesis extends this idea to
application graphs to determine if particular motifs indicate what application protocol
a host is using.
3.4.1 Definition of a Motif
In mathematical terms, a graph G′ = (V ′, E ′) is the subgraph of G if V ′ ⊆ V and
E ′ ⊆ E. A motif then, is any of such subgraphs that occur significantly more than
in random networks. The level of significance required depends on the problem, but
as an example Milo et al. consider those patterns with a p-value of 0.01, meaning
that there is only a 1% chance of seeing a particular pattern as many or more times
in random networks than is observed in the original network [3]. Motif detection is
depicted in Figure 3.2.
Figure 3.2: Schematic view of motif detection [3]
Generally speaking, motifs of order 3 or larger are considered when performing
motif searches. However, searching for large motifs can be prohibitively expensive
because of the computational complexity involved. Several algorithms [32, 33] have
been developed to increase the efficiency of these searches and allow for the analysis
of large networks containing thousands of edges and nodes. Figure 3.3 shows the
24
thirteen possible directed edge combinations for motifs of order 3. In application
graphs, the edge directionality indicates the flow of data between two hosts, such as
a request from a client to a server, or the response from the server back to the client.
Additional motif characteristics are described in Chapter 5.
Figure 3.3: All 13 configurations of order 3 connected subgraphs [3]
3.4.2 Function of Motifs
Several studies suggest that motifs can be linked to specific functions within a network.
Milo et al. analyze the motifs found in the direct transcriptional interactions in
Escherichia coli and find three highly significant motifs [16]. Their study states that
the appearance of network motifs at high frequencies suggests that they may have
some specific functions in the information processing performed by the network.
A different study analyzes the feed-forward loop, or FFL (Figure 3.4). In a FFL,
X regulates transcription factor Y , and both jointly regulate gene Z. Mangan et
al. show that it acts as a sign-sensitive delay element, in that it responds rapidly to
step-like stimuli in one direction (ON to OFF) and at a delay to steps in the opposite
direction (OFF to ON). They argue that this type of control mechanism can filter
out fluctuations in input stimuli [34].
X → Y
ց _Z
Figure 3.4: A feed-forward loop
25
3.5 Analysis of Application Graphs
The application graphs studied in this work are hybrid networks, reflecting a mix of
social interactions and computer network architectures. Although there are no genes
present that require precise regulation like in the biological networks discussed previ-
ously, network functions are carried out in a controlled environment that must follow
a set of established protocols. For example, if a user wishes to talk to another user on
a network via the AIM instant messaging service, each user must first authenticate
and establish a connection to a central server; the computers do not simply send text
back and forth between the two. Protocol behaviors are described in Chapter 4.
In terms of graph properties, application graphs are modeled with unweighted,
directed edges and do not contain any self-loops. If a computer connects to a service
running locally, the connection goes over the loopback interface, and is not visible
on the network traces examined. The edge direction is set to match the observed
traffic flow, which may be either unidirectional or bidirectional. If two computers
communicate at any time during a period of monitoring, an edge is drawn between
them. Edge weights are not used in this study, but may be considered in the future
to provide further detail when determining the application type.
The traditional graph measures defined previously are appropriate for the study of
application graphs because of the social aspect of the communications. Application
graphs are formed through specific user actions, such as surfing the web, checking
email, and sharing music. It is also for this reason that the study of motifs within
application graphs is interesting. In systems biology, processes such as gene transcrip-
tion and regulation are not voluntary tasks; cell survival depends on them. Chapter
5 details the methodology employed to describe application graphs based on their
traditional and motif-based characteristics.
Chapter 4: Data Selection and Considerations
As is the case in any type of research, proper data selection is imperative for
producing accurate results and analysis. This chapter examines several of the issues
involved with the collection and sampling of computer network data in an effort
to build a baseline measure for “normal” network behavior, and concludes with an
overview of the seven application protocols selected for this study.
4.1 Network Trace Files
The pcap library provides the packet-capture and filtering engines of several popular
network analysis and monitoring tools [35]. Some examples include tcpdump, nmap,
Wireshark and the Snort IDS. Tcpdump in particular is a valuable tool for capturing
packets as they come across a network interface card, a process known as “sniffing”,
and logging them in a raw format which can then be analyzed by other tools as shown
in Figure 4.1. Although tcpdump is able to capture all of the data associated with
each network packet such as packet length, flags and checksum values, only a few of
the fields specified by the IP, TCP, and UDP RFC documents [36, 37, 38] are needed to
model application graphs: source IP, destination IP, source port and destination port.
These four pieces of information are enough to uniquely identify a process running
over a computer network between two host. The creation of application graphs is
discussed in Chapter 3 and the implementation detailed in Chapter 5.
4.2 Challenges Associated with Network Data Collection
Pang et al. identify three key goals of sharing network data with other researchers:
verification of previous research, direct comparison of competing ideas on the same26
27
Figure 4.1: Tcpdump output containing timestamp, protocol, source IP, source port,destination IP, destination port, packet length and packet flags
data, and a broader view than a single investigator can obtain on their own [39].
Unfortunately there are several concerns that must be addressed such as the amount
of data collected, the accuracy of the data and protection of users’ privacy. This
section outlines a few of these issues.
4.2.1 Data Capture
Increased utilization and line speeds of today’s high speed, high capacity networks
present challenges for collecting network data in terms of data rate, storage and pro-
cessing [40]. A packet sniffer can easily log hundreds of gigabytes of data in a single
day, even on a moderately sized network. A study of traffic collected at Dartmouth
College shows a significant increase in peer-to-peer, streaming multimedia and VoIP
traffic, whereas initial network usage was dominated by web traffic [41]. Both static
and streaming multimedia applications require significantly more bandwidth than
simple web documents or other non-interactive file types. Research characterizing
YouTubeTM traffic found that 90% of videos requested by University of Calgary cam-
pus network users were larger than 21.9 MB [42], orders of magnitude larger than the
file sizes of other content types.
In addition to requiring a great deal of storage space, high speed packet capturing
also requires fast memory access and high disk speed so that packets can be written to
the disk before the capture buffer is full and loses packets. Although undesirable, this
28
behavior does not affect the study of application graphs proposed by this study, which
uses individual packets to establish a communication link instead of aggregated flows
(all packets associated with a particular origin and destination pair). Two nodes in an
application graph will be connected if any packets are sent between them, regardless
of which part of the flow they come from, beginning, middle, or end. Therefore,
partial flows are considered in these graphs.
Another advantage of using individual packets is that TCP and UDP sessions
don’t need to be defined. TCP connections are established by a three-way handshake
between the client and server, and are terminated by a FIN and FIN-ACK sequence.
The formal establishment or tear-down of a TCP session might not be correctly
logged for several reasons: the sniffer could be turned on or off in the middle the
session, parts of the handshake could be dropped by the sniffer, or either the client or
server could disconnect without following the closing protocol. UDP doesn’t establish
formal sessions like TCP does, so UDP flows are sometimes segregated by establishing
a timeout value for which the flow is terminated if there is no activity. The edges in
an application graph are binary in nature and only indicate whether or not host A
communicated with host B.
4.2.2 Privacy and Sanitization of Data
Monitoring network traffic may raise serious privacy concerns, as data sent in cleartext
(i.e. not encrypted) is easily read by sniffing. Data such as usernames and passwords
sent to web sites via the HTTP protocol instead of the encrypted HTTPS protocol can
be effortlessly obtained by an attacker on the network. Even if sensitive information
is not being sent, an attacker can log all text and images downloaded by a user as he
or she surfs the web, and reassemble the browser sessions later.
Not only do researchers who collect this kind of data need to be sure to sanitize
29
the resulting log files to ensure the privacy of users, but they must also disguise the
IP addresses of machines on the network so that an attacker does not have a map of
the network with which to launch an attack. Several methods and tools have been
developed to accomplish these tasks, such as [43, 39, 44, 45].
Often times a network administrator or developer does not need to log the packet
payloads to perform tasks such as verifying routes or debugging programs that utilize
sockets. If this is the case, only the packet headers are logged and the rest of the
packet is discarded. Only storing packet headers also helps alleviate the issue of
storage space discussed in the previous section. This shortcut cannot be used in the
case of signature-based intrusion detection systems which rely on scanning the payload
of a packet for known signatures that indicate an attack. The methods proposed in
this thesis do not consider packet payloads, but only the information readily available
in the packet header.
4.2.3 Network and Data View
Ideally, a “God’s eye view” of a computer network would reveal all communication
links within the network as well as connections from within the network to other
networks outside of it. Unfortunately, many sniffers are placed at gateway nodes at
the edge of a network and only capture traffic leaving from and coming to the network.
As a result, traffic originating from within a network and destined for internal servers
(web and application servers, email servers, etc.) is not logged because it never reaches
the gateway. Some data collection projects such as [46] attempt to address the lack
of internal enterprise network traffic that is available for research.
One drawback of the research method proposed in this paper is that it currently
assumes network activity for a particular application is limited to a single port. This
is not true for out-of-band protocols such as FTP that send authentication and control
30
messages on over one port but use another for data transfer. Even if provided with a
complete view of the network, the data is segmented into individual port numbers for
analysis. Therefore, network communications over multiple port numbers will not be
visible. If a client connects to a web server on port 80, and that web server requests
data from a MySQLTM server (default port 3306) or an IBM WebSphere® Application
server (default port 2809), only one part of the process is visible at a time, either client
to web server, web server to database, or web server to application server. Seeing all
components of a particular process would reveal interesting structural motifs, but the
motif and node properties examined in isolation still hint at the function of the nodes.
Possible techniques for aggregating data for different views are discussed in Chapter
7.
4.3 Data Sources
The data sets used in this study come from three different sources in an attempt to
show measureable differences in protocols and behavior, even across networks with
different underlying architectures and usage patterns. One data set often used in in-
trusion detection research is the 1998 & 1999 DARPA Intrusion Detection Evaluation
Data Set [47]. The primary reason this data was not selected, however, is because of
its age; as Henderson et al. point out, the type of traffic seen in computer networks
has changed [41]. This is not to imply that the approach described in this thesis would
not work with older data, but that newer network traces containing a wider variety of
application use might prove more interesting to examine. Additionally, traffic for the
DARPA initiative is synthetic, whereas the data sets described in this section contain
real network data that reflects current trends in in network and protocol use. Table
4.1 provides overview statistics for the traces examined.
31
4.3.1 Dartmouth College Wireless Traces
The CRAWDAD project at Dartmouth College provides an archive of wireless network
data from several contributors around the globe. Included in the archive is 163 GB
of packet headers captured from eighteen buildings on the campus during the Fall
2003 semester [48]. Data collected is representative of traffic in residential buildings,
academic buildings, as well as the library. It has been sanitized in such a way that
the IP addresses are consistent across traces, allowing for a more complete picture of
network use. The campus wireless network contains several thousand users and over
450 wireless access points.
4.3.2 LBNL/ICSI Enterprise Tracing Program
The ICSI Enterprise Tracing Program hopes to provide a view into the internal traffic
for an entire enterprise site [46]. These traces, taken from the Lawrence Berkeley
National Laboratory (LBNL) in 2004 and 2005 span more than 100 hours of activity
and include traffic from several thousand internal hosts. The data is sanitized in
accordance with the methodologies described in [39]. Like the Dartmouth wireless
traces, only packet headers were captured and the payload discarded.
Dartmouth LBNL OSDI
Capture length (seconds) 21818.575 600.079 193.348
Number of packets 2023527 2261261 324116
Avg. packets/sec 92.743 3768.274 1676.335
Number of bytes 1092602793 778659304 94814149
Avg. bytes/sec 50076.726 1297595.353 490380.757
Table 4.1: Summary statistics of three trace files examined
4.3.3 OSDI Conference Network Traces
The last source of data used for analysis in this paper also comes from the CRAW-
DAD archive, and includes traces from ten sniffers at the 2006 Operating Systems
32
Design and Implementation (OSDI) Conference [49]. Researchers collected this data
to enable the analysis of the behavior of a heavily used wireless LAN. The data was
initially sanitized on-the-fly and then reprocessed off-line to further obfuscate the
MAC addresses as necessary. Although this data set does not have the “enterprise”
characteristics of the previous two, its inclusion helps to determine the generalizabil-
ity of the methods proposed in this work to different networks and network points of
view.
4.4 Protocol Selection
Several criteria were used to select the protocols examined in this paper including
availability, popularity and diversity. First and foremost there must be enough data
samples of a particular protocol in the trace files to be able to perform the graph
characteristic and motif analysis. To achieve this goal, more well-known and widely
used protocols were chosen. Also, protocols that have different architectures (client-
server vs. peer-to-peer for example) were selected in order to highlight the differences
in node characteristics. Because packet payloads are not inspected, applications that
operate on official IANA port numbers and are in-bound protocols are used so that
reasonable assumptions can be made about the data, and that the port numbers
accurately reflect the protocol being used. As a reminder, the port number is not
used to classify applications, but is only used to provide class labels.
AOL Instant Messenger (AIM)
AOL’s instant messaging client has been a popular application for users around the
world for over a decade. AIM uses a proprietary protocol called OSCAR to commu-
nicate with other clients [50]. Multicast routing architectures exist and are used by
some chat programs such as IRC, but all AIM connections go through a centralized
33
server. Users authenticate to the AIM login server on port 5190. Once the user’s
session has been established, all chat communications also go through central AIM
servers on port 5190. The exception to this is when a user attempts to establish a
direct connection to another user (such as when sending pictures or other files), in
which case the communication goes directly to the other user and bypasses the cen-
tral AIM servers. Therefore AIM is primarily a client-server application, with some
peer-to-peer capabilities as well. This study restricts itself to communications on port
5190, so any direct file transfers are ignored.
HyperText Transfer Protocol (HTTP)
The HTTP protocol is used to retrieve hyper-linked text documents from the world
wide web [51]. A client initiates an HTTP request by connecting to a web server,
typically on port 80. The web server then responds with a status line, as well as
another message including the contents requested, such as an HTML file or an im-
age. HTTP is a stateless protocol, which means no information is retained between
requests. This protocol falls directly into a client-sever architecture model.
Domain Name System (DNS)
DNS is a hierarchical naming system that maps meaningful domain names to numer-
ical IP addresses [52]. If a DNS server does not know the correct mapping for a given
domain, it can instruct the DNS resolver on the client side of where to query next to
attempt to resolve the address. DNS primarily communicates via UDP on port 53,
and also follows a client-server architecture. Its hierarchical nature however makes it
an interesting selection for analysis.
34
Kazaa
Kazaa is a peer-to-peer file sharing application built on the FastTrack protocol that
operates on port 1214. This protocol employs the use of supernodes for scalability
purposes. A supernode is any node on the network that also acts as a proxy and
relayer for the network, and handles data flow and connections for other users. A
peer-to-peer network should be more highly connected than a client-server model
since all nodes in the network act as both clients and servers for each other.
Microsoft Active Directory (MSDS)
Microsoft Active Directory is a client-server protocol that provides a way to manage
objects and relationships across a network. Objects can be resources such as printers,
services such as email, or users (accounts and groups). It provides several services such
as DNS-based naming, authentication methods and LDAP-like directory services.
Active Directory Domain Services (MSDS) is the central location for configuration
information, authentication requests and information about network objects [53]. It
operates on port 445. Windows shares and Active Directory are commonly used
in Windows-based networks, and its inclusion for analysis provides an example of
platform-dependant network traffic.
NetBIOS Name Service
Netbios (Network Basic Input/Output System) is used to allow applications on sep-
arate computers to communicate over a local area network. It provides three main
services: (i) name service for name registration resolution, (ii) session service for
connection-oriented communication, and (iii) datagram distribution service for con-
nectionless communication. The name service communicates over port 137 with either
the TCP or UDP protocol. A computer, which has a unique host name, might have
35
multiple NetBIOS names. The inclusion of NetBIOS for analysis is interesting be-
cause it often receives port scans and is frequently the target of malicious attacks.
The architecture of Netbios communications is a bit unique in that it does not fall
cleanly into a client-sever model, nor does it fit the P2P model. It will occasionally
use broadcast messages, and Netbios hosts can also be configured as peers.
Secure Shell (SSH)
Secure shell is a protocol that allows encrypted data to be sent between two com-
puters on a network. It is often used for remote administration of other computers,
creating secure tunnels for web browsing and securely copying files. SSH is primarily
used on UNIX/Linux environments and runs on port 22. SSH utilizes a client-server
architecture.
Chapter 5: Experimental Methodology
The analysis of application graphs involves several stages and requires the use of
many different software tools. The major tasks include: parsing and storing network
data, creation of graphs and vertex profiles, node property analysis, motif searching,
and creating a classifier to predict application labels. Optimization of the classification
process via feature weighting is also considered. This chapter describes the process
as well as the tools used, which are open-source and freely available. For the reader’s
convenience, a summary diagram is given in figure 5.1.
Parse and
store data
Construct
application
graphs
Traditional
profile creation
and analysis
Motif−based
profile creation
and analysis
Nearest
neighbor
classification
Nearest
neighbor
classification
Evolutionary
attribute
weighting
Evolutionary
attribute
weighting
Wireshark
Afterglow
Python
NetworkX
Python
NetworkX
Python
FANMOD
RapidMiner RapidMiner
Pro
cess
Tools
Figure 5.1: Overview of the proposed methodology and tools used
5.1 Hardware and Linux System
All tests were run on a multi-core system running the Linux kernel version 2.6.22.
The system contains four dual-core AMD 64-bit processors running at 1.8 GHz each.
It uses a shared-memory architecture with 8 GB of memory. Although most of the
tools are not written to take advantage of multiple cores, the hardware architecture
allows for analysis of multiple network traces to happen simultaneously.
36
37
5.2 Packet Capture and Storage
The network traces are in the pcap format as described in Chapter 4. Modified
parsers based on those distributed as part of The Afterglow Project [54] were used to
parse pcap files. Additionally, Wireshark [10] was used to extract basic information
from the network trace files, including the source IP, destination IP, source port,
destination port, timestamp, protocol and packet length.
tshark −t e −r input . pcap tcp or udp | python tshark2mysql . py t
Figure 5.2: Storing packets from a pcap file into a MySQL database
Once the packets have been parsed, they are stored into a MySQLTM database
for later retrieval. This is done to facilitate later steps in the process so that packets
can be selected based on their source or destination port numbers, protocol type, or
other attribute. Figure 5.2 illustrates the process of parsing and storing information
from input.pcap into a MySQL database table t. Each network trace file is stored in
a unique table within the database.
5.3 Creation of Application Graphs
The next step in the process is to model the application graphs and analyze the
traditional measures as described in the first half of Chapter 3. NetworkX is a package
for the creation, manipulation and study of complex networks, written in the Python
programming language [55]. Graphs are created by querying a MySQL database
table for all entries for which either the source or destination port number matches
the port number of one of the seven application protocols. Although port numbers
do not always accurately reflect the application bound to them, they are generally
a strong indicator, especially for the well known port numbers 0-1023 (e.g., HTTP
servers typically listen on port 80 for connections). For the purposes of this work, the
38
applications tied to each port number are assumed to be correct; however, verification
is not possible because the packet payloads have been discarded.
Graph Size
There are two possible approaches to consider when creating and comparing appli-
cation graphs across different protocols. One approach is to collect network data
for a constant amount of time and then study the resulting communications that
occurred. For example, each application graph would represent ten hours of SSH
communications, 10 hours of HTTP communications, and so on. This approach is
complicated for several reasons. The data collected for these experiments come from
several different sources where the network monitors were run for variable lengths of
time. Because certain applications are much more heavily used than others, there
are no guarantees that there would be enough, or conversely, too much, data for each
protocol. Additional data pre-processing would be required to mitigate variables such
as these.
Instead, this study attempts to analyze application graphs that have a similar
number of participating nodes by allowing the network capture lengths to vary. By
doing so, the number of hosts in each application graph (Table 5.1) can be more easily
controlled, and the interaction patterns that form over a longer amount of time may
be viewed. The order of each application graph is consistent within examples of a
particular protocol, but not across protocols due to lack of availability.
Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH
Order 50 68 80 80 40 76 40
Table 5.1: Graph orders for each application protocol
The graph orders serve as an upper bound for the number of nodes considered
in each application graph. For example, if ten network trace files are searched for
39
AIM traffic, the lowest number of hosts found communicating on port 5190 in any
of the files becomes a limiting factor for the other trace files. If another file were
to have 120 hosts using AIM, only the first fifty would be considered. However, all
communications among those fifty hosts over the duration of the network trace would
be added to the graph, not just the links created as each node is added.
Connected Components
It is natural to expect that not all nodes in an application graph are connected; groups
of nodes exist whom communicate with one another, but there is no communication
that connects one group to the next. Scale-free networks, whose degree distributions
follow a power law degree, usually contain one larger connected component and a few
smaller connected components [24]. Several of the protocols exhibited this behavior
of scale-free networks except for SSH, which showed a high number of small connected
components. When calculating graph attributes for disconnected graphs, each con-
nected component is treated as a separate graph. This ensures that measures such as
distances between nodes, radius, diameter and others discussed in Chapter 3, are well
defined and don’t use infinite path lengths to represent nodes that are disconnected,
as is done in some graph algorithms.
5.4 Traditional Graph Measures
After the application graphs have been created, NetworkX is again used to perform
calculations on each connected component. There are eleven node characteristics
(described in Chapter 3) examined in this approach: indegree, outdegree, total degree,
clustering coefficient, betweenness centrality, degree centrality, closeness centrality,
eigenvector centrality, eccentricity, whether or not the node is a center node, and
whether or not the node is a periphery node. NetworkX provides functions to calculate
40
all of these values except for eigenvector centrality, whose implementation is listed in
Appendix B.
5.5 Motif Analysis
For the task of motif searching, several network analysis tools were considered: FAN-
MOD [32], mfinder [56], MAVisto [57] and Pajek [58]. FANMOD (Fast Network Motif
Detection) was selected for its rich feature set, including support for a graphical user
interface in addition to command line invocation, generation of motif images, the
ability to export results in several different formats and support for node and edge
colors. FANMOD employs algorithms [59] that allow it to search motifs faster and
with less memory usage than other motif searching tools such as MAVisto or mfinder.
FANMOD Parameters
FANMOD’s support for colored vertices and colored edges allows for the encoding
of additional information into a motif structure beyond its shape and directionality
of its edges. This study assumes edges are directed, but further exploits the flow of
information between nodes by defining three classes (colors) of hosts: client, server,
and peer (see Figure 5.3). In computer networking these terms refer to nodes that are
consumers of a service, providers of a service, or nodes that act as both consumers
and providers of a service. Here, they take on a related meaning, but are defined
somewhat more generally based on the source IP, source port and destination port of
a packet.
Definition 1 Let φ be the port number associated with an application and v be a
node in Gφ, the application graph of φ. Also, let P be a packet sent by v over the
network, where Psp and Pdp are the source and destination ports of P , respectively.
Client, server and peer are defined as follows:
41
• If Pdp = φ then v is a client node, labeled vc
• If Psp = φ then v is a server node, labeled vs
• If vc and vs hold, then v is a peer node, labeled vp
As described in Chapter 2, a client computer will request a service by connecting
from a random upper port on its own machine to a particular port φ on a server.
Therefore, if the destination port of a node v is the port number φ, then v is consuming
the service provided on that port. For many protocols, the server will the send data
back to the client from port φ. In this instance, the server becomes the source IP,
sending data from φ, as specified in Definition 1. The third part of the definition
describes the behavior of peers, computers that act as both “clients” and “servers”.
Therefore any node that is found to both send and receive data on port φ is labeled
as a peer. Edge colors are not currently used, but are considered for future work.
Figure 5.3: A motif with colored vertices
Random Graphs and Statistical Significance
Milo et al. define “network motifs” as patterns of interconnections that occur in
complex graphs at numbers that are significantly higher than those in randomized
networks [3]. To determine which motifs are statistically significant, network data
retrieved from the MySQL database for a particular application is converted into a
FANMOD input file, which describes an application graph. First, the input graph is
searched for all motifs of either order 3 or order 4. Next, a set of random graphs is
generated and the motif search is repeated for each. The frequency at which motifs
occur in the original input graph (the application graph), is compared to the frequency
42
of those same motifs in the random graphs. Motifs that are found significantly more
often in the original graph are then reported to the user.
Random graphs are created through a series of “edge switching operations” (Fig-
ure 5.4(a)), using the original input graph as a starting point. Several parameters
exist to control the randomization process. In this study, the “local constant” model
is selected, which means that unidirectional edges are only exchanged with other uni-
directional edges. As a result, the number of bidirectional edges incident upon each
vertex remains constant. Another option selected is to “regard vertex color” (Fig-
ure 5.4(b)), which indicates that edges should only be exchanged if their endpoints
have the same color. These options were enabled to create randomized networks that
are still structurally similar to the original network and allow for a more stringent
comparison [3].
(a) Edge-switching operation (b) Regard vertex color
Figure 5.4: FANMOD edge-switching process for generating random networks [4]
There is some variability in defining the phrase “statistically significant”, as dif-
ferent thresholds can be used. The mfinder Tool Guide suggests using 5,000+ random
graphs when searching for motifs of order 3, and 10,000+ random graphs when search-
ing for motifs of order 4, and suggests that ten occurrences of any individual motif is
a good starting point to measure the quality of a result [56]. For the motif analysis
performed in this work, similar parameters were used. To keep the problem size rea-
sonably small, 5,000 random networks were generated when searching for both order
3 and order 4 motifs; FANMOD supports sampling subgraphs for motif searching, but
43
an exhaustive enumeration of all subgraphs is currently used. The FANMOD output
files provide the user with several pieces of information, including the percentage of
subgraphs each motif was found in (for the original networks as well as in random
networks), and a p-value for each motif.
The p-value is a statistical measure that describes the probability of obtaining a
result at least as extreme as the result observed, given that the null hypothesis, or
expected outcome, is true [60]. If the p-value falls outside the range of the expected
outcome and is less than some threshold value α, the result is said to be statistically
significant at the α level. In practice, values of 5%, 2.5% and 1% for α are common.
For this study a “significant motif” is any motif that occurs in at least 1% of subgraphs
in the original graph and has a p-value of 0 — essentially those motifs that FANMOD
determines are the most significant results. By setting the threshold at this p-value,
the number of motifs considered for analysis can be limited to a more select group.
Experimentally, this results in a list of 130 significant motifs.
5.6 Vertex Profiles
The data collected through traditional graph analysis and motif analysis is used to
create “profiles” of each node, used in the classification algorithm described in Section
5.7. Each profile is a data point in d-dimensional space, where d is the number of
attributes a in the profile. A list of n vertices labeled v1 . . . vn, is written as follows:
v1 = [ a1, a2, a3, . . . , ad ]v2 = [ a1, a2, a3, . . . , ad ]
.... . .
vn = [ a1, a2, a3, . . . , ad ]
Figure 5.5: Arrays representing vertex profiles
44
The attributes a1 through ad can be any numerical data type or numerical repre-
sentation of a data type. In the traditional graph analysis approach there are eleven
attributes (degree counts, centrality measures, etc.), so d = 11. These attributes
include integers, real numbers and boolean values represented as a 1 (true) or a 0
(false). The intent is to associate an application with a certain profile.
The idea of vertex profiles based on graph characteristics is adapted to the motif-
based approach. Instead of considering the percentage of subgraphs a motif occurs
in, however, a binary attribute is created that describes whether or not the vertex
participates in the motif. One of the files output by FANMOD motif searches is a
comma separated file with the following format:
adjacency matrix, <participating vertices>
After the significant motifs have been determined, the script in Listing B.5 parses
these files and creates the profiles for each node based on its participation in significant
motifs. The dimensionality d of the motif profiles is 130: 42 of these are significant
order 3 motifs, while the remaining 88 are significant order 4 motifs. The motif profiles
were built putting both order 3 and order 4 motifs together because preliminary
investigations indicated that the combination is more successful in separating and
identifying protocols than either can do alone.
5.7 K-Nearest Neighbor Classification
The tasks of node classification and feature weighting (Section 5.8) are handled by
RapidMiner, an open source knowledge-discovery and data mining tool built on the
JavaTM platform [61]. RapidMiner allows for data mining experiments to be quickly
constructed through the use of hundreds of modular operators that handle data pre-
processing and post-processing, creation and storage of models, clustering and classi-
fication tasks as well as statistical analysis.
45
The k-nearest neighbor (k-NN) classification algorithm is a simple machine learn-
ing algorithm for classifying objects based on the closest training examples in a feature
space. First, the data is broken into a training set and a test set. The proximity of a
test point z to every point in the example set is then calculated.
Algorithm 1 The k-Nearest Neighbor classification algorithm [62]
1: Let k be the number of nearest neighbors and D be the set of training examples.2: for each test example z = (x′, y′) do3: Compute d(x′,x), the distance between z and every example (x, y) ∈ D.4: Select Dz ⊆ D, the set of k closest training examples to z.5: y′ = argmax
v
∑
(xi,yi)∈DzI(v = yi)
6: end for
After the nearest-neighbor list is obtained, the test example z is classified based
on a majority vote of the k nearest neighbors to z. In this study, k = 1, so a test
point z is given the same label as the label of its closest neighbor. In line 5 above,
yi is the class label for one of the nearest neighbors, and I() is an indicator function
that returns the value 1 if its argument is true and 0 otherwise.
5.7.1 Measuring Profile Separation
A number of similarity measures can be used to determine the distance from one point
to another (line 3 of Algorithm 1), the selection of which depends on the type of data
being examined and its application [62]. For example, there is Euclidean distance,
Jaccard coefficient, cosine similarity and simple matching coefficient. Euclidean dis-
tance is often chosen for instances of dense continuous data such as that found in
the profiles for traditional graph analysis. Although the simple matching coefficient
is often applied to binary data such as the motif profiles, the Euclidean distance is
also suitable, and is selected for use in this study. Equation 5.1 defines this distance,
46
where n is the number of dimensions and xk and yk are the kth attributes of x and y.
d(x,y) =
√
√
√
√
n∑
k=1
(xk − yk)2 (5.1)
5.7.2 Cross Validation of Classification Results
Cross validation is the process of partitioning a data set into n subsets, training a
classifier with n − 1 subsets and using the remaining subset to test. The process is
then repeated n times with a different subset left out each time. In 10-fold cross
validation, for example, ten subsets are created, each containing 10% of the original
data set. In each iteration, 90% of the data is used for training and 10% is used for
testing. To avoid the possibility of a particular subset not containing any instances
(or very few) of a particular label, stratified sampling is used so that each subset
contains roughly the same propotion of labels.
5.8 Genetic Algorithm Feature Weighting
Genetic algorithms provide a unique way to investigate which attributes in the vertex
profiles more effectively classify application protocols, as well as increase the accuracy
of the nearest-neighbor classifier. This study utilizes a genetic algorithm to perform
evolutionary feature weighting, the results of which are applied to each profile and a
new classifier is built using the nearest neighbor algorithm as before. Alternatively, a
brute-force search of all attribute combinations (given by Equation 5.2) might possible
for a small attribute set such as in the case of traditional graph analysis, but is not
feasible for motif analysis.
c =
d∑
n=1
(
d
n
)
(5.2)
47
Given that d = 11 for traditional graph analysis, applying the equation above
reveals that the number of possible attribute combinations c is 2,047. However when
d = 130 for motif analysis, c = 1.36× 1039. Genetic algorithms present one possible
way to explore this problem space within a reasonable amount of time.
5.8.1 Overview of Genetic Algorithms
Genetic algorithms view learning as a competition among a population of evolving
candidate problem solutions [63]. During each generation, a fitness function (line 4 of
Algorithm 2 below) assesses each candidate to determine if it will contribute to the
next generation of solutions. Those solutions found to be the most “fit” are selected
for mating and mutation and shape the following generation of potential solutions.
The algorithm repeats until some termination condition is met, such as convergence
to a solution or a predefined number of generations have been tested.
Algorithm 2 General form of a genetic algorithm [63]
1: Set time t = 02: Initialize the population P(t)3: while the termination condition is not met do4: Evaluate fitness of each member of the population P(t).5: Select members from population P(t) based on fitness.6: Produce the offspring of these pairs using genetic operators.7: Replace, based on fitness, candidates of P(t), with these offspring.8: Set time t = t + 19: end while
Before the algorithm can begin, candidate solutions must be transformed into
an appropriate representation for the problem space. Examples include binary, real
value, and tree encoding, the simplest and most studied of which, is binary encoding
[64]. Initial populations of candidate solutions are usually chosen at random. The
population size depends on the problem space, but studies have shown a population
size of 20-30 generally yields good results [65, 66]. At this point, the fitness function
evaluates each member of the population, and selects the best candidates for mating.
48
Figure 5.6 shows what a simple crossover of two binary strings might look like.
Input Bit Strings Output Bit Strings0011|0001
=⇒0011|1011
0100|1011 0100|0001
Figure 5.6: Single-point crossover of two binary strings
Just like in evolutionary biology, there is a small chance for random genetic mu-
tation to occur. In a binary string, this would equate to one of the bits being flipped
from a 0 to a 1 or vice versa, allowing the algorithm to explore more of the problem
space and not settle on a local solution. Previous research suggests variable values
for mutation probability, such as 0.0001 [65] or 0.005 - 0.01 [66].
5.8.2 Feature Weighting
The RapidMiner distribution contains a prewritten test for evolutionary feature weight-
ing using genetic algorithms. In the context of application identification, the function
used to determine the fitness of candidate solutions is based upon whether or not the
potential solution increases the overall accuracy of the 1-NN classifier. Solutions that
do not increase the performance of the classifier are not selected to contribute to the
following generation of candidate solutions. The algorithm is run for thirty genera-
tions, by which time the system should stabilize and begin to converge to a solution
set of attribute weights. The full test parameters, including crossover probabilities,
mutation rates and candidate selection can be found in Appendix C.
Chapter 6: Results and Analysis
To test the accuracy and performance of the proposed approaches, several exper-
iments were run using the method described in Chapter 5. In total, 65 application
graphs were examined: ten AIM, ten DNS, ten HTTP, five Kazaa, ten MSDS, ten
Netbios and ten SSH, with the discrepancy resulting from fewer examples of peer-to-
peer Kazaa traffic being located in the data traces that were downloaded. Profiles
were classified using both traditional graph attributes and motif-based attributes.
Afterwards, profile attributes were weighted using a genetic algorithm. This step
aims to provide two important functions: to increase the accuracy of the classifiers
and to provide insight into which attributes are more effective for identifying network
applications. Analysis of several key attributes is provided in this chapter, as well as
a direct comparison between traditional and motif-based profiles.
6.1 Preliminary Investigations
Because motifs have not been applied in the realm of application identification, some
preliminary classification work was required to vet this approach. Profiles for each of
the 65 application graphs were created using a combination of significant order 3 and
order 4 motifs, where each attribute represents the frequency of a particular motif
within that graph. The results provided in Table 6.1 were encouraging (for the full
classification results see Appendix D). Perhaps a more interesting question, however,
is not if an entire graph of communications can be correctly classified, but instead
if the activities of a particular host can be identified. It is on this question that the
remainder of the chapter is focused.
49
50
Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH
Accuracy 80% 80% 90% 40% 60% 100% 80%
Table 6.1: Classification accuracy of 65 application graphs
6.2 Initial Results
Classification results are presented as confusion matrices; each row of the table repre-
sents a predicted class label (an application in this case), while the columns represent
the true class label. The boldface numbers along the diagonal indicate correct clas-
sifications. Confusion matrices also show false positives and false negatives. Data
points that are predicted to have a certain class label but are incorrect are known
as false positives, found in the rows of the matrices. False negatives are examples of
a particular class that are incorrectly labeled, shown in the columns. For example,
given a set of data that is predicted to be hosts sharing files via Kazaa, true positives
would be those hosts that are actually using the P2P application while false positives
would be those hosts that are not. Conversely, given a set of data that is known to
be file-sharing hosts, false negatives would be those that are not labeled as using the
Kazaa application.
True A True B True C Precision
Pred. A 5 2 0 71.4%
Pred. B 3 3 2 37.5%
Pred. C 0 1 11 91.7%
Recall: 62.5% 50.0% 84.6% � 70.4%
Table 6.2: An example confusion matrix with three classes
The performance of the nearest-neighbor classification models are described by
three different accuracy measures. The overall accuracy of a model (denoted by� next to the number in the bottom-right corner) is simply the number of correct
classifications (true positives) over all classifications. Given a set of predictions of a
particular label, class precision is a measure of the accuracy of those predicted labels.
51
It is the ratio of correct predictions of label l to all predictions of label l. It can be
written:
precision =true positives
true positives + false positives(6.1)
Class recall (also called sensitivity) measures the accuracy of predicted labels if
provided a complete set of true labels. Recall is given by the following equation:
recall =true positives
true positives + false negatives(6.2)
Table 6.2 displays the results of an example classification experiment , as well as
the accuracy, precision and recall measures. This confusion matrix shows that while
the classifier has some trouble distinguishing between class A and class B, it can
effectively detect examples of class C.
6.2.1 Traditional Graph Measure Profiles
To remind the reader, traditional graph measure profiles have eleven attributes in-
cluding degree counts, centrality measures and clustering coefficient (see Section 5.4
for the full list). There are a total of 3,940 unique hosts found in the 65 application
graphs. Each line of the input file for the nearest neighbor algorithm contains the
true label assigned to the host and the eleven graph measures associated with that
host. Not all protocols have an equal number of training examples due to the popu-
larity and availability of certain applications in the trace files, but each protocol has
400–800 examples.
The computational load of a single test point for the nearest neighbor algorithm
is O(nd) where n is the number of training samples and d is the number of attributes.
When using 10-fold cross validation, the test set is n10
and ten iterations are run,
making the overall complexity of this process O(n2) if we absorb the constant d into
52
the expression. Although other methods exist to reduce the number of computations
necessary, RapidMiner is able to generate the model and accuracy measures for 3,940
data points in just a few seconds. Table 6.3 shows the resulting confusion matrix,
where each row is the predicted label and each column is the actual label.
AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 417 29 89 37 91 41 320 40.72%DNS 2 612 6 1 7 20 10 93.01%HTTP 49 11 658 2 20 32 11 84.04%Kazaa 1 1 3 355 1 1 0 98.07%MSDS 10 5 13 1 255 10 29 78.95%Netbios 13 19 24 4 9 655 2 90.22%SSH 8 3 7 0 17 1 28 43.75%Recall 83.40% 90.00% 82.25% 88.75% 63.75% 86.18% 7.00% � 75.63%
Table 6.3: Confusion matrix of unweighted traditional graph measures
After this initial classification test there are five protocols that have greater than
80% of their profiles labeled correctly (class recall): AIM, DNS, HTTP, Kazaa and
Netbios. The class recall of SSH is strikingly low at 7%. SSH is a particularly
difficult application to classify because it is used for a variety of tasks, including
remote management of hosts, application tunneling and file transfers using the secure
copy program. The traditional measures of SSH application graphs often resemble
those of other applications, resulting in a low class recall value.
Despite the fact that about half of the eleven traditional attributes are real-valued,
several ties occur when the nearest-neighbor algorithm computes the Euclidean dis-
tance from a test point to the training points in the model. In this study, a tie
situation is termed a profile collision (described in a moment). This behavior is de-
sireable if the tie is between profiles of the same class, suggesting that certain profiles
are strongly indicative of a particular application. However, many examples also tie
with examples from several different classes. When this happens, RapidMiner naively
assigns the first label in the list of ties to the test point. The order of this comparison
is affected by the order of the input data. AIM is the first protocol in the list, so
many multi-class ties are labeled as AIM, which explains why 80% of SSH traffic is
53
classified as AIM. It also explains the low class precision for AIM, since any multi-
class tie involving AIM receives that label. Regardless of RapidMiner’s tie-breaking
algorithm, classification inaccuracies are caused in part by the high rate of overlap
among classes when using traditional graph measure profies.
Profile Collisions
More often than not, single-class ties between a test point and the model have the
same label. With the method used by RapidMiner to break ties, multi-class ties are
more likely to be incorrectly labeled than single-class ties. Table 6.4 summarizes the
single and multi-class ties for each protocol. The three protocols involved in the most
multi-class ties (AIM, MSDS, SSH) also have the lowest class recalls.
AIM DNS HTTP Kazaa MSDS Netbios SSH
Single-class Ties 182 567 422 343 219 530 17
Multi-class Ties 110 29 49 46 83 33 331
Table 6.4: Number of single and multi-class ties for traditional graph measures
To explore the properties of vertex profiles in more detail, profile collisions are
introduced. A profile collision can occur in one of two ways: if the distance from a
test point to a training point is equal to zero, or, if a test point is equidistant from
two or more training points, written mathematically as follows:
d(z, t1) = 0, or
d(z, t1) = d(z, t2) [ = . . .= d(z, tn) ]
where z is a test point and t1 – tn are training points. Note that the first type of
profile collision results from vertices having identical profiles. The collision graphs in
Figure 6.1 show the total number of collisions with other profiles for each protocol.
For example, Figure 6.1(a) shows that AIM profiles collide with SSH profiles more
frequently than they do with other AIM profiles.
54
(a) AIM (b) DNS (c) HTTP (d) Kazaa
(e) MSDS (f) Netbios (g) SSH
AIMDNSHTTPKazaaMSDSNetbiosSSH
Figure 6.1: Profile collisions for traditional graph measures
6.2.2 Motif-based Profiles
Although profiles based on traditional graph measures can identify applications with
some success, there is certainly room for improvement. This section presents the
results of utilizing network motifs in a new way to characterize application graphs.
The result files from FANMOD were parsed for significant motifs (those with a p-value
of 0 and that occur in at least 1% of subgraphs), finding 130 motifs to be used as
profile attributes. Only those hosts that participate in at least one of the significant
motifs were considered in this part of the study. As a result, the total number of
profiles is 3,546 instead of 3,940 as in the traditional approach.
Table 6.5 presents the classifier results using motif-based profiles. Although the
discussion comparing the performance of profile types is saved for later, the reader
may notice that the class recall has improved for six of the seven protocols measured
— all except for AIM. Only four protocols score greater than 80% in the motif-based
approach, but three of these four (DNS, Kazaa, Netbios) have improved well into the
55
90% range. Figure 6.2 provides the profile collisions for the motif approach, while the
numbers of single and multi-class ties are given in Table 6.6. Again, the protocols
involved in a higher percentage of multi-class ties generally score lower than those
that do not.
AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 277 10 61 0 21 0 33 68.91%DNS 9 630 5 8 6 0 3 95.31%HTTP 136 13 665 1 21 6 29 76.35%Kazaa 5 0 0 370 4 34 2 89.16%MSDS 4 4 15 1 256 1 4 89.82%Netbios 2 1 0 0 4 699 0 99.01%SSH 35 1 24 2 60 0 84 40.78%Recall: 59.19% 95.60% 86.36% 96.86% 68.82% 94.46% 54.19% � 84.07%
Table 6.5: Confusion matrix of unweighted motif-based profiles
AIM DNS HTTP Kazaa MSDS Netbios SSH
Single-class Ties 93 576 446 195 231 611 17
Multi-class Ties 283 32 223 181 119 47 123
Table 6.6: Number of single and multi-class ties for motif-based profiles
Synthesizing the accuracy results and collision information, it can be seen that the
classifier confuses two protocol pairs in particular with one another: AIM with HTTP
and Kazaa with Netbios. In the case of Netbios name service, a broadcast message is
sent to the local network to locate a particular machine that has a registered name.
Somewhat similarly, if a Kazaa user wishes to locate a file, they contact an active
supernode which then communicates with the ordinary nodes attached to it to query
the desired file.
Even though AIM and HTTP are both classified more accurately with motifs
than with traditional measures, the indistinctness of the boundary between the two
protocols is a bit surprising. The arrangement of nodes in Figure 6.3(a) reflects the
expected communication patterns given the functional and social characteristics of
the HTTP protocol. Some web servers are more popular than others and would
have a higher degree count than others. Additionally, web servers will often establish
56
(a) AIM (b) DNS (c) HTTP (d) Kazaa
(e) MSDS (f) Netbios (g) SSH
AIMDNSHTTPKazaaMSDSNetbiosSSH
Figure 6.2: Profile collisions for motif-based profiles
communications with other web servers to pull content from RSS feeds, ad servers,
or other content providers. With the exception of direct connections for file transfers,
all AIM communications go through a central server, so one would expect to see
stronger influence of a star topology in the application graphs. Figure 6.3(b) shows
that this does not seem to be the case. There are several possible reasons for this.
For example, the actual IP address of the central AIM server will be anonymized
to a different random IP address across each of the network trace files. Given the
popularity of instant messaging and the prevalence of the AIM client, it is possible
that there are actually several servers that handle connections, and are load-balanced
as necessary. A more in-depth examination of some application protocols will be
required in future work.
The other protocol that needs further explanation is SSH. Figure 6.2(g) shows
that SSH has a high collision rate with other protocols, and points out an important
weakness in the current motif-based approach. The application graph in Figure 6.3(c)
57
(a) HTTP (b) AIM (c) SSH
Figure 6.3: Depiction of three application graphs: HTTP, AIM and SSH
Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH
Data Kept 93.6% 96.9% 96.3% 95.5% 93.0% 97.4% 38.8%
Table 6.7: Percentage of original data used in motif-based profiles
shows the tendency of SSH application graphs to be very disconnected, comprised of
several much smaller components instead of one larger connected component. Current
motif profiles are based on order 3 and order 4 network motifs, so these pairs of
connections are ignored. This is made evident by Table 6.7, which shows that less
than 40% of SSH traffic is included in the motif model, significantly less than the
other six protocols. This difficulty presents another interesting area of work to be
performed in the future.
6.3 Weighted Profiles and Key Attributes
In an effort to improve the performance of the two types of classifiers, the attributes
of each profile were weighted using a genetic algorithm. This process only increases
the accuracy of each model slightly, but it also allows for the investigation of a prob-
lem space that might otherwise be too computationally expensive to explore. This
section details the results of weighting the attributes and discusses some of the key
characteristics.
58
6.3.1 Attribute Weights of Traditional Graph Measures
After running the evolutionary feature weighting experiment for thirty generations,
a set of attribute weights was obtained that increased the overall accuracy of the
traditional graph measure classifier roughly 4%. The weights of the eleven attributes
are provided in Table 6.8.
Attribute WeightIndegree 0.259Outdegree 0.000Total Degree 0.023Clustering Coefficient 0.172Betweenness Centrality 0.271Degree Centrality 1.000Closeness Centrality 0.257Eigenvector Centrality 0.633Eccentricity 0.596Center 0.393Periphery 0.096
Table 6.8: Attribute weights for traditional graph measures
The weights of the attributes reflect the interaction of all eleven graph measures
and are the values that maximize the accuracy of the classifier. They should there-
fore not be interpreted too literally in isolation. For example, degree centrality was
weighted with a 1.000, the highest possible weight. This does not mean that an accu-
rate classifier could be built on this attribute alone. Section 6.5 addresses this point
further. However, the table does still provide some insight as to which attributes
might be more useful when providing classification tasks. It is not surprising that the
degree counts are not weighted especially high, as they are a very generic measure.
The “periphery” attribute has a low weight because it is not a very unique measure;
out of the 3,940 profiles, 2,132 of them are periphery nodes. In contrast, only 773
nodes are central nodes, which has a higher attribute weight.
Figure 6.4 shows the per-protocol accuracy of both unweighted and weighted pro-
files based on traditional graph measures. The confusion matrix for weighted at-
59
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Accuracy of unweighted vs. weighted traditional graph measure profiles
UnweightedWeighted
Figure 6.4: Accuracy of unweighted vs. weighted traditional graph measure profiles
tributes can be found in Appendix D. As one would hope, the weighted attribute
profiles perform slightly better than their unweighted counterpart for each protocol.
The class recall for SSH is again very low for the same reasons described previously.
6.3.2 Attribute Weights of Motif-based Measures
Because of the high dimensionality of motif-based profiles, it becomes more impor-
tant to take advantage of other methods such as genetic algorithms to explore the
attributes. Figure 6.5 depicts the ten most heavily weighted motifs and their corre-
sponding weighted values. In the figure, green nodes represent clients, black nodes
represent servers and red nodes represent peers, as specified in Definition 1. As with
the weights of traditional graph measures, these weights reflect the combined infor-
mation from all attributes.
Motif 6.5(a) is the most highly weighted of the significant motifs found in this study
and only occurs in two application graphs. There are 24 instances of it in a MSDS
60
(a)1.000
(b)0.662
(c)0.650
(d)0.632
(e)0.585
(f)0.545
(g)0.537
(h)0.503
(i)0.503
(j)0.502
Figure 6.5: The ten highest-weighted motifs and their corresponding weights
graph and another 137 instances of it in a Netbios graph. Although weighted lower,
motif 6.5(b) occurs overwhelmingly more frequently in Netbios (1,007 instances) than
it does in MSDS (3 instances) or DNS (2 instances). If a node were to occur in these
two motifs, there would probably be a good chance that the host was using the Netbios
application.
Unfortunately the weights do not indicate which particular application(s) a motif
help to delineate, only which motifs successfully increase the overall accuracy of the
classifier. Perusing the profile data reveals that instances of many motifs are found
in several or all of the applications studied. This is not to say motif profiles are
unsuitable for describing computer networks (as they have shown a great deal of
promise already), rather that no single motif is indicative of a particular application.
Given the complexity of the highly dynamic interactions that occur in computer
networks, this is not entirely surprising. It is possible that different types of motifs
(described in Chapter 7) could be even more beneficial than the current generation
of motifs and motif profiles.
One final point to address before moving on to a comparison of traditional and
motif-based profiles is the performance of unweighted vs. weighted motif profiles,
61
shown in Figure 6.6. There is a slight increase in classification accuracy in each of
the protocols except for Kazaa, which sees no additional gain from attribute weight-
ing. The overall accuracy of the model increases to 85.70%, a difference of 1.63%.
Appendix D contains the confusion matrix for weighted motif profiles.
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Accuracy of unweighted vs. weighted motif−based profiles
UnweightedWeighted
Figure 6.6: Accuracy of unweighted vs. weighted motif-based profiles
6.4 Comparison of Profile Types
This section compares the two profile types side-by-side and discusses some of the
advantages and disadvantages of each approach. The motif-based model generally
outperforms traditional graph measures, though this is not always the case as shown
in Figure 6.7. Notably, the traditional profiles significantly outperform motif-based
profiles for classifying AIM traffic, while the reverse is true for SSH (again, the SSH
results should be taken with a grain of salt due to the fact that slightly less than 40%
of SSH traffic is classified by the second approach).
Weighting the profile attributes benefits traditional graph measures more than
62
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Traditional Graph Measures vs. Motif−based Profiles (Unweighted)
TraditionalMotif
(a) Unweighted
AIM DNS HTTP Kazaa MSDS Netbios SSH Overall0
10
20
30
40
50
60
70
80
90
100
Protocol
% C
orre
ctly
Lab
eled
Traditional Graph Measures vs. Motif−based Profiles (Weighted)
TraditionalMotif
(b) Weighted
Figure 6.7: Accuracy comparison of unweighted profile types
motif descriptions. One reason for this might be the type of data used to describe
each profile. Traditional profiles are comprised of a mixture of binary, real-valued
and integer data. In addition to being purely binary, motif profiles are also sparse;
most nodes only participate in very few of the 130 significant motifs. As a result,
many of the motif weights are multiplied by zero, resulting in no information gain.
Regardless, weighting the attributes does not change which type of classifier performs
better for a particular protocol with the exception of HTTP. Unweighted motifs have
a 4% accuracy advantage over traditional measures, but fall to a 1% disadvantage
when the profiles are weighted.
Advantages and Disadvantages of Profile Types
Motif-based profiles have a slight advantage over traditional measures in a few cate-
gories. The overall accuracy of the motif-based classifiers is higher than that of the
traditional classifiers, both unweighted and weighted. Also, motif profiles result in
more favorable overlap with other profiles. Only 10% of motif profiles do not match
another profile, and 61% match profiles of a single label (note that “match” means
a Euclidean distance of zero, not an identical profile). With traditional measures on
63
the other hand, 58% match a single label, and nearly 25% of profiles do not match
any other profile.
Traditional graph measures are less demanding to compute than their motif coun-
terparts. Even though some graph measures are O(n3) where n is the order of the
graph, calculations can be performed extremely quickly because n is small in the
application graphs examined: 40 ≤ n ≤ 80. Motif searches are computationally
expensive and can be prohibitively so when searching for large motifs. This study
found that an exhaustive search of order 3 motifs could be completed in roughly 7-8
minutes, while an exhaustive search of order 4 motifs took 6-8 hours to complete.
6.5 Considerations for Optimizing Classifier Performance
There are several ways in which the performance of application classifiers may be
improved. An “on the fly” traffic classification system would need to be as fast as
possible so that network latency is minimized. One way to achieve increased classifier
speed is to reduce the dimensionality of the data. Already the evolutionary feature
weighting performed by the genetic algorithm has indicated which attributes are more
valuable to the classifier. Attributes below a certain threshold value could be ignored,
at the expense of a little bit of accuracy. Figure 6.8 demonstrates the accuracy of
models based on a single traditional graph measure.
By far, eigenvector centrality, closeness centrality and degree centrality provide
the most information to the classifier, each scoring better than 65% on its own. Most
of the attributes score no better than a random guess with a 17
chance of being correct,
shown as a vertical dotted line in the graph. Recall that eigenvector centrality assigns
a centrality score to a vertex proportional to that of its neighbors. This metric is more
“social” in nature than some of the others in that the centrality scores of neighboring
vertices are considered in the calculation. The idea of “distance” in an application
64
0 10 20 30 40 50 60 70
Center
Periphery
Indegree
Eccentricity
Total Deg.
Outdegree
Clust. Coeff.
Betw. Cent.
Deg. Cent.
Close. Cent.
Eig.Cent.
Attr
ibut
e
% Correctly Labeled
Accuracy When Using a Single Attribute
Figure 6.8: Accuracy of single attribute classification
graph is a bit tricky because it does not consider the number of hops data must go
through to reach its final destination nor the physical distance between hosts. There-
fore the “closeness” of closeness centrality describes the social usage of an application
and suggests that the average shortest path length between nodes differs somewhat
from application to application. The degree centrality is essentially a weighted degree
count, which again suggests that the size of connected components within application
graphs are important, influenced in part by the popularity of servers and services.
In addition to reducing the dimensionality of the attribute profiles, one can also
consider reducing the number of data points used in the training phase of the nearest
neighbor algorithm. Exploring the effectiveness of smaller classification models has
two important implications. First of all, it suggests that a more lightweight classifier
could be built when heading towards a real-time implementation. Secondly, it shows
that the methods proposed in this study can be used for smaller networks and not
just those containing thousands of nodes.
To test this hypothesis, several unweighted classifiers were built for each profile
type with an increasing number of nodes in each model. The data was selected at
65
random, while keeping the proportions of each class label the same as in the models
previously discussed. All of the test parameters are as they were before, including
the use of 10-fold cross validation to determine the accuracy. The results of this
experiment are illustrated in Figure 6.9.
500 1000 1500 2000 2500 3000 3500 40000
10
20
30
40
50
60
70
80
90
100
Number of Profiles
% C
orre
ctly
Lab
eled
Accuracy as the Number of Traditional Graph Measure Profiles Increases
AIMDNSHTTPKazaaNetbiosMSDSSSH
(a) Traditional graph measures
500 1000 1500 2000 2500 3000 3500 40000
10
20
30
40
50
60
70
80
90
100
Number of Profiles
% C
orre
ctly
Lab
eled
Accuracy as the Number of Motif−based Profiles Increases
AIMDNSHTTPKazaaNetbiosMSDSSSH
(b) Motif-based profiles
Figure 6.9: Comparison of profile types as the size of the training set increases
The classifiers tend to perform slightly better as the number of training data
points increases, but sometimes negligibly so. DNS, Kazaa and Netbios seem to
benefit the least from having additional training examples, while AIM and MSDS
fluctuate quite a bit more. It is interesting to note applications that were previously
classified more accurately also exhibit more stable behavior in Figure 6.9. This is
true for both profile types. For example, AIM and MSDS were by far the least
accurately classified protocols using a motif-based approach, and their trend lines
exhibit the most volatility in 6.9(b). In contrast, DNS, Kazaa and Netbios were the
most accurately classified protocols, and their trend lines are nearly flat. This finding
suggests that the protocols which can be clearly described by a profile (traditional or
motif-based) can be learned with a relatively low number of training points. Further
investigation into the AIM and MSDS protocols is needed to understand why the
accuracy of AIM peaks at 2,500 nodes and then declines, while the accuracy of MSDS
66
peaks at 1,000 nodes and then drops significantly.
6.6 Limitations of Current Approach
This chapter has demonstrated the promise of using vertex profiles to identify appli-
cation usage across a computer network. A few of the shortcomings of the proposed
methodology have been touched upon already, but are summarized here. Graph size
is an important factor to consider, since more “interesting” vertex characterizations
arise from the complex interactions of hosts. Motif-based profiles become more de-
scriptive as hosts communicate with a larger number of other hosts. The current
generation of classification models suffer when there is heavy overlap among profiles,
resulting in a distance of zero. A more intelligent tie-breaking scheme could yield
better performance for those protocols that share application graph characteristics.
Currently, the motif-based approach only considers motifs of order 3 and order 4.
This causes a problem for protocols like SSH that tend to have a large number of
small connected components instead of fewer large connected components. Some of
the stages in the process are computationally expensive. The genetic algorithm used
for feature weighting is a very time-consuming endeavor and does not yield the de-
sired increase in performance. On the other hand, once a network is learned and a
classifier built, the attribute weights need only be computed once and can be applied
in O(n) time to the attributes collected for the test points. Additionally, the analysis
techniques put forth by this work require a view of the network that shows as many
of the interactions as possible.
Chapter 7: Conclusions and Future Work
The tasks of managing and securing computer networks are becoming increasingly
complicated due to the use of applications over non-standard port numbers as well as
the use of data encryption techniques. These practices subvert a network administra-
tor’s ability to provide quality of service to legitimate users, ensure compliance with
security policies, as well as prevent outside intruders from gaining access to a system.
Intrusion detection systems and network monitoring tools that rely on deep packet
inspection are ineffective when data transfers are encrypted. Several previous studies
have attempted to classify network application usage by examining flow characteris-
tics pertaining to a particular series of communications between two hosts, examining
attributes such as the size of the data packets being sent, packet inter-arrival time
and session lengths.
This thesis has proposed an interdisciplinary approach to the study of networks
through the characterization of application graphs. It is an “in the dark” methodology
that relies on the communication patterns found in a network, rather than the contents
of packet payloads or port numbers used by the application. A wide variety of graph
measures heavily borrowed from social network analysis are used to create vertex
profiles to determine the application in which the host participates.
Furthermore, this work has uniquely applied motif-based analysis, used almost
exclusively in systems biology, to the study of application graphs. This method of
detecting significant subgraph patterns has shown a great deal of promise for modeling
and classifying application protocols. It has been shown that motifs can not only be
used to express communication patterns, but also to indicate the functional role of a
67
68
host. In this study, nodes were labeled as either a client, server, or peer based upon
their interactions at the transport layer. This information was used to generate motifs.
A second type of vertex profile was defined, based upon a node’s participation (or lack
thereof) in the motifs that were found to be significant across all of the application
protocols examined.
Through empirical testing, this study has shown that both types of profiles can
determine what application a host is using with a reasonable amount of accuracy.
Although some protocols like SSH and AIM present difficulties, many of the others
can be classified with greater than 80% accuracy, and in the case of weighted motif
profiles, as high as 96% for the peer-to-peer application Kazaa. In general, a motif-
based approach out-performs traditional graph measures and seems to have more
potential for related work in the future.
One issue to consider is how to best manage connected components in application
graphs that only contain two nodes. This phenomena was found to occur frequently
in SSH, contributing to the fact that less than 40% of SSH hosts were classified by
the motif-based approach. Ignoring vertex colors, there are three possible order 2
motifs: A → B, A ← B, and A ↔ B. Unfortunately, the edge-switching operations
for creating random graphs will not provide sufficiently randomized graphs, so it is
unlikely to find a particular pattern that is statistically significant.
Currently, the only information utilized in the creation of application graphs is the
source and destination IP addresses, and the source and destination port numbers.
The motif-based approach provides some additional information by using vertex colors
to represent node types, but other information could also be exploited to color the
edges. For example, colors could be used to denote the amount of data transferred
between two nodes. This would help create more detailed profiles that might be able
to distinguish between applications with similar connection patterns, but use network
69
bandwidth in ways that are distinct from one another. Also related to the creation of
application graphs, it would be interesting to observe the data flow through all nodes
involved in a particular activity and not just the flow on a particular port number.
A server might request content from an application or database server in response
to a client’s request for a web document. Back-end communications to other related
services occur on a port other than 80, the usual HTTP port number.
Another area to explore is the different machine learning techniques that can be
applied to vertex profiles for classification and feature weighting; nearest-neighbor
and genetic algorithms are only two possibilities. The many parameters of these
algorithms require further tuning to optimize the classification accuracy of the models
built. This thesis describes a process which allows the substitution of particular
algorithms. For example, a Bayes classifier or support vector machine could be used
instead of nearest-neighbor, while principal component analysis could be used in place
of the genetic algorithm [67, 62].
Although not used in the current approach, temporal information could also prove
to be useful in classifying application protocols. One approach would be to encode
information such as session lengths or packet inter-arrival times into the edge colors.
Another use of time-based information would be to observe communication patterns
over a much smaller time window (on the order of seconds or minutes instead of hours)
and determine how a node’s participation in motifs changes over time.
Moving away from implementation details and algorithm decisions, this type of
research can be expanded outside of application identification. Assuming that the
process can be tweaked to allow a high accuracy in protocol recognition, this approach
could be used to detect anomalies in network behavior. Hosts that participate in
activities that look similar to a known application but differ more than an established
threshold value would be considered anomalous for that particular application and
70
trigger an alert. One final consideration is pushing this research further into the
realm of social network analysis, applying it to the detection of communities and
associations within a network, such as locating all hosts that are part of the same
online gaming community.
References
[1] P. Dyson, Dictionary of Networking. Sybex, 1999.
[2] A. S. Tanenbaum, Computer Networks. Prentice Hall, 2003.
[3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, andU. Alon, “Network motifs: simple building blocks of complex networks.”Science, vol. 298, no. 5594, pp. 824–827, October 2002. [Online]. Available:http://dx.doi.org/10.1126/science.298.5594.824
[4] F. Rasche and S. Wernicke, “Fanmod manual,” 2006.
[5] Input federal it market forecast 2008 - 2013. [Online]. Avail-able: http://www.input.com/corp/library/detail.cfm?itemid=5437&cmp=OTC-fedinfosecfcst08
[6] The sans security policy project. [Online]. Available:http://www.sans.org/resources/policies/
[7] S. Christey and R. A. Martin, “Cve - vulnerability type distributions in cve,” 2007technical white paper on the distribution of vulnerabilities reported to CVE.
[8] Internet assigned numbers authority: Assigned port numbers. [Online]. Available:http://iana.org/assignments/port-numbers
[9] Netstat. [Online]. Available: http://www.netstat.net/
[10] Wireshark: A network protocol analyzer. [Online]. Available:http://www.wireshark.org/
[11] Snort - the de facto standard for intrusion detection/prevention. [Online]. Available:http://www.snort.org/
[12] M. E. J. Newman, “Coauthorship networks and patterns of scientific collaboration,”in Proceedings of the National Academy of Science, 2004, pp. 5200–5205.
[13] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications.Cambridge University Press, 1994.
[14] C. Yang and T. Ng, “Terrorism and crime related weblog social network: Link, contentanalysis and information visualization,” Intelligence and Security Informatics, 2007IEEE, pp. 55–58, May 2007.
[15] E. Yeger-Lotem, S. Sattath, N. Kashtan, S. Itzkovitz, R. Milo, R. Y. Pinter,U. Alon, and H. Margalit, “Network motifs in integrated cellular networks oftranscription-regulation and protein-protein interaction.” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 101, no. 16, pp. 5934–5939,April 2004. [Online]. Available: http://dx.doi.org/10.1073/pnas.0306752101
71
72
[16] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in thetranscriptional regulation network of escherichia coli,” Nat Genet, vol. 31, no. 1, pp.64–68, May 2002. [Online]. Available: http://dx.doi.org/10.1038/ng881
[17] J. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration andsymmetry-breaking,” 2007, pp. 92–106.
[18] U. Alon, “Network motifs: Theory and experimental approaches,” Nature ReviewsGenetics, vol. 8, no. 6, pp. 450–461, Jun. 2007.
[19] J. Day and H. Zimmermann, “The osi reference model,” Proceedings of the IEEE,vol. 71, no. 12, pp. 1334–1340, Dec. 1983.
[20] V. Cerf and R. Kahn, “A protocol for packet network intercommunication,” Commu-nications, IEEE Transactions on [legacy, pre - 1988], vol. 22, no. 5, pp. 637–648, May1974.
[21] L. Euler, “Solutio problematis ad geometriam situs pertinentis,” in Commentariiacademiae scientiarum imperialis Petropolitanae. St. Petersburg Academy, 1736,vol. 8.
[22] R. G. Busacker and T. L. Saaty, Finite Graphs and Networks, ser. International Seriesin Pure and Applied Mathematics. McGraw-Hill, 1965.
[23] G. Chartrand and P. Zhang, Introduction to Graph Theory. McGraw-Hill, 2005.
[24] A. A. Nanavati, R. Singh, D. Chakraborty, K. Dasgupta, S. Mukherjea, G. Guru-murthy, and A. Joshi, “Analyzing the Structure and Evolution of Massive TelecomGraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, pp.703–718, March 2008.
[25] E. W. Dijkstra, “A note on two problems in connexion with graphs,” NumerischeMathematik, vol. 1, pp. 269–271, 1959.
[26] L. C. Freeman, “Centrality in social networks conceptual clarification,” Social Net-works, vol. 1, no. 3, pp. 215–239.
[27] P. Bonacich, “Technique for analyzing overlapping memberships,” Sociological Method-ology, 1972.
[28] M. E. J. Newman, Mathematics of Networks. Palgrave Macmillan, 2008.
[29] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Na-ture, vol. 393, 1998.
[30] S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagl, K. Levitt, J. Rowe, S. Staniford-chen, R. Yip, and D. Zerkle, “Grids – a graph-based intrusion detection system forlarge networks,” in In Proceedings of the 19th National Information Systems SecurityConference, 1996, pp. 361–370.
73
[31] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multilevel traffic classifi-cation in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Appli-cations, technologies, architectures, and protocols for computer communications. NewYork, NY, USA: ACM, 2005, pp. 229–240.
[32] S. Wernicke and F. Rasche, “Fanmod: a tool for fast network motif detection,” Bioin-formatics, vol. 22, no. 9, pp. 1152–1153, 2006.
[33] R. Itzhack, Y. Mogilevski, and Y. Louzoun, “An optimal algorithm for counting net-work motifs,” Physica A, vol. 381, pp. 482–490, Jul. 2007.
[34] S. Mangan, A. Zaslaver, and U. Alon, “The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks.” Journal of Molecular Biololgy, vol.334, no. 2, pp. 197–204, November 2003.
[35] Tcpdump/libpcap public repository. [Online]. Available: http://www.tcpdump.org/
[36] J. Postel, “Internet Protocol,” RFC 791 (Standard), Sep. 1981, updated by RFC1349. [Online]. Available: http://www.ietf.org/rfc/rfc791.txt
[37] J. Postel, “Transmission Control Protocol,” RFC 793 (Standard), Sep. 1981, updatedby RFC 3168. [Online]. Available: http://www.ietf.org/rfc/rfc793.txt
[38] J. Postel, “User Datagram Protocol,” RFC 768 (Standard), Aug. 1980. [Online].Available: http://www.ietf.org/rfc/rfc768.txt
[39] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil and packet traceanonymization,” ACM Computer Communication Review, vol. 36, no. 1, pp. 29–38,January 2006. [Online]. Available: http://www.icir.org/mallman/papers/devil-ccr-jan06.pdf
[40] G. Iannaccone, C. Diot, I. Graham, and N. McKeown, “Monitoring very high speedlinks,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on InternetMeasurement. New York, NY, USA: ACM, 2001, pp. 267–271.
[41] T. Henderson, D. Kotz, and I. Abyzov, “The changing usage of a mature campus-widewireless network,” Computer Networks, vol. In Press, Accepted Manuscript. [Online].Available: http://dx.doi.org/10.1016/j.comnet.2008.05.003
[42] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characterization: a viewfrom the edge,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference onInternet measurement. New York, NY, USA: ACM, 2007, pp. 15–28.
[43] E. Blanton. (2008, January) tcpurify. [Online]. Available:http://irg.cs.ohiou.edu/ eblanton/tcpurify/
[44] T. Gamer, C. P. Mayer, and M. Scholler, “PktAnon - A Generic Framework for Profile-based Traffic Anonymization,” PIK Praxis der Informationsverarbeitung und Kommu-nikation, vol. 2, pp. 67–81, Jun. 2008.
74
[45] D. Koukis, S. Antonatos, D. Antoniades, E. P. Markatos, P. Trimintzios, andM. Fukarakis, “CRAWDAD tool tools/sanitize/generic/anontool (v. 2006-09-26),”Downloaded from http://crawdad.cs.dartmouth.edu/tools/sanitize/generic/AnonTool,Sep. 2006.
[46] (2005) Lbnl enterprise trace repository. [Online]. Available:http://www.icir.org/enterprise-tracing/
[47] MIT Lincoln Laboratory: 1999 DARPA Intru-sion Detection Evaluation Data Set. [Online]. Available:http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1999data.html
[48] D. Kotz, T. Henderson, and I. Abyzov, “CRAWDAD trace dart-mouth/campus/tcpdump/fall03 (v. 2004-11-09),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03, Nov. 2004.
[49] R. Chandra, R. Mahajan, V. Padmanabhan, and M. Zhang, “CRAW-DAD data set microsoft/osdi2006 (v. 2007-05-23),” Downloaded fromhttp://crawdad.cs.dartmouth.edu/microsoft/osdi2006, May 2007.
[50] Oscar protocol. [Online]. Available: http://dev.aol.com/aim/oscar/
[51] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, andT. Berners-Lee, “Hypertext transfer protocol – http/1.1,” RFC 2616 (Standard), Jun.1999. [Online]. Available: http://www.ietf.org/rfc/rfc2616.txt
[52] P. V. Mockapetris, “Domain names - implementation and specifica-tion,” RFC 1035 (Standard), United States, 1987. [Online]. Available:http://www.ietf.org/rfc/rfc1035.txt
[53] Active directory. [Online]. Available:http://www.microsoft.com/windowsserver2008/en/us/active-directory.aspx
[54] R. Marty. Afterglow. [Online]. Available: http://www.afterglow.sourceforge.net/
[55] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics,and function using networkx,” in Proceedings of the 7th Python in Science Conference(SciPy2008), Pasadena, CA USA, Aug. 2008, pp. 11–15.
[56] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” 2002.
[57] F. Schreiber and H. Schwobbermeyer, “Bioinformatics applications note structuralbioinformatics mavisto: a tool for the exploration of network motifs,” 2005.
[58] W. de Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network Analysis withPajek. Cambridge University Press, 2005.
[59] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions onComputational Biology and Bioinformatics, vol. 3, no. 4, pp. 347–359, 2006.
75
[60] W. Mendenhall and R. J. Beaver, Introduction to Probability and Statistics, 8th ed.PWS-Kent Publishing Company, 1991.
[61] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “Yale: rapid pro-totyping for complex data mining tasks.” New York, NY, USA: ACM, 2006, pp.935–940.
[62] Pang-Ning Tan and Michael Steinbach and Vipin Kumar, Introduction to Data Mining.Addison Wesley, 2006.
[63] G. F. Luger, Artificial Intelligence: Structures and Strategies for Complex ProblemSolving, 5th ed. Addison Wesley, 2005.
[64] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, 1998.
[65] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEETransactions on Systems, Man and Cybernetics, vol. 16, no. 1, pp. 122–128, Jan. 1986.
[66] J. D. Schaffer, R. A. Caruana, L. J. Eshelman, and R. Das, “A study of control param-eters affecting online performance of genetic algorithms for function optimization,” inProceedings of the third international conference on Genetic algorithms. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 51–60.
[67] D. Lay, Linear Algebra and Its Applications, 2nd ed. Addison Wesley, 2000.
Appendix A: Examples of Application Graphs
Figure A.1: Application graphs depicting AIM communications
Figure A.2: Application graphs depicting DNS communications
Figure A.3: Application graphs depicting HTTP communications
76
77
Figure A.4: Application graphs depicting Kazaa communications
Figure A.5: Application graphs depicting MSDS communications
Figure A.6: Application graphs depicting Netbios communications
Figure A.7: Application graphs depicting SSH communications
Appendix B: Code Listings
Listing B.1: tshark2mysql.py – stores pcap data into a MySQL database
#!/ usr / bin /python
# This f i l e parse s output from stdout and i n s e t s i t in to a MySQL# database . I t assumes the database has a l ready been creat ed and
5 # w i l l c reat e the necessary t a b l e## The Tshark command i s :# t shark −t e −r <pcap f i l e > t cp or udp
10 import sysimport MySQLdb
i f l en ( sys . argv ) != 2 :sys . e x i t ( ”Supply name of tab l e to s to r e data in \n” )
15try :
conn = MySQLdb . connect ( host = ” l o c a l h o s t ” ,user = ” root ” ,passwd = ”pass ” ,
20 db = ”data ” )except MySQLdb . Error , e :
sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )
cur sor = conn . cur sor ( )25 cur sor . execute ( ”DROP TABLE IF EXISTS %s ” % sys . argv [ 1 ] )
cur sor . execute ( ”””CREATE TABLE %s (
id INT(11) NOT NULL AUTO INCREMENT,30 t s DOUBLE NOT NULL DEFAULT ’ 0 . 0 ’ ,
p r o to co l VARCHAR(12) NOT NULL,s i p VARCHAR(15) NOT NULL,spor t INT(5) NOT NULL DEFAULT ’0 ’ ,dip VARCHAR(15) NOT NULL,
35 dport INT(5) NOT NULL DEFAULT ’0 ’ ,l ength INT(11) NOT NULL DEFAULT ’0 ’ ,PRIMARY KEY id ( id )
) ;””” % sys . argv [ 1 ] )
40 r c = 0while True :
l i n e = sys . s td i n . r e a d l i n e ( )i f not l i n e : break
v = l i n e . s p l i t ( ’ ’ )45 tmp = [ ]
for i in range ( l en (v ) ) :i f v [ i ] not in ( ’ ’ , ’−> ’ , ’ ’ ) :
tmp . append (v [ i ] )v = tmp
50 i f l en (v ) == 8 :try :
t s = f l o a t (v [ 1 ] )s i p = v [ 2 ]
78
79
dip = v [ 3 ]55 spor t = in t ( v [ 4 ] )
dport = in t ( v [ 5 ] )proto = v [ 6 ]l ength = in t (v [ 7 ] [ : − 2 ] ) # s t r i p o f f t he newl ine charac t e rs q l = ”INSERT INTO %s ( ts , protoco l , s ip , sport , dip , dport , l ength )
VALUES (%f , \’%s \ ’ , \’%s \ ’ , %d , \’%s \ ’ , %d , %d) ” % ( sys . argv [ 1 ] , ts ,proto , s ip , sport , dip , dport , l ength )
60 try :cur sor . execute ( s q l )r c += cur sor . rowcount
except MySQLdb . Error , e :print ”Error [%d ] : %d : %s” % ( rc , e . a rgs [ 0 ] , e . a rgs [ 1 ] )
65 except Exception , e :print ”ERROR: ” , v
cur sor . c l o s e ( )conn . commit ( )
70 conn . c l o s e ( )
print ( ”\n%d rows i n s e r t e d i n to %s \n” % ( rc , sys . argv [ 1 ] ) )
Listing B.2: graph utils.py – implementation of adjacency matrix conversion andeigenvector centrality using the NetworkX API
import networkx as NXimport math
def adj matr i x (G) :5 ”””
Function takes a networkx . Graph as an argument ( undi rec t ed )and re turns a l i s t o f l i s t s r e p r e s en t i n g the correspondingadjacency matrix . I t can can be re f e renced as you woulda normal 2D matrix A[ i ] [ j ]
10node IDs must be [1 G. order ] ( taken care o f in e i g e n v e c t o r c e n t r a l i t y ( ) )
”””adj = [ ]for n in G. nodes ( ) :
15 row = [ ]for m in range ( l en (G. nodes ( ) )+1) : row . append (0)for m in NX. ne i ghbor s (G, n) : row [m] = 1adj . append ( row )
# Get r i d o f f i r s t e lement o f each row ( nodes s t a r t at 1 , adj i s 0−based )20 for i in range ( l en ( adj ) ) : adj [ i ] = adj [ i ] [ 1 : ]
return adj
def e i g e n v e c t o r c e n t r a l i t y (G) :”””
25 Function takes an undi rec t ed graph (Graph or XGraph) andre turns a d i c t i onary o f e i g env e c t o r c e n t r a l i t i e s , keyedby node ID ( s im i l a r to c e n t r a l i t y f unc t i on s in networkx )
Function w i l l map node l a b e l s to i n t e g e r s [1 G. order ]30
Algorithm adaped from : h t t p ://www. ana l y t i c t e c h . com/networks / centa ids . htm”””
80
e i g e n v e c t o r c e n t r a l i t i e s = {}evCent r a l i t y = [ ]
35 evUpdate = [ ]maxValue = −1.0
for i in range (G. order ( ) ) :evCen t r a l i t y . append ( 1 . 0 )
40 evUpdate . append ( 0 . 0 )
H = NX. c on v e r t n od e l a b e l s t o i n t e g e r s (G, f i r s t l a b e l =1, d i s c a r d o l d l a b e l s=Fal s e)
l a b e l s = {}for k , v in H. node l abe l s . i t e r i t em s ( ) : l a b e l s [ v ] = k
45 A = adj matr i x (H)
# 30 i t e r a t i o n s should be enough to converge to a s o l u t i onfor x in range (30) :
for i in range (G. order ( ) ) :50 evUpdate [ i ] = 0 .0
for j in range (G. order ( ) ) :i f (A[ i ] [ j ] != 0) : evUpdate [ i ] += evCent r a l i t y [ j ]
maxValue = 0for i in range (G. order ( ) ) :
55 maxValue += evUpdate [ i ] * evUpdate [ i ]maxValue = math . s q r t (maxValue )for i in range (G. order ( ) ) :
evCen t r a l i t y [ i ] = evUpdate [ i ] / maxValuefor i in range (1 , G. order ( ) + 1) :
60 e i g e n v e c t o r c e n t r a l i t i e s [ l a b e l s [ i ] ] = evCent r a l i t y [ i −1]
return e i g e n v e c t o r c e n t r a l i t i e s
Listing B.3: node props main.py – creates application graphs from MySQL databaseand computes traditional graph metrics using the NetworkX API
#!/ usr / bin /python
”””This program crea t e s a DiGraph and c a l c u l a t e s var ious graph metr ics ,
5 conver t ing DiGraph to Graph as necessary f o r some metrics
Usage :arg1 = t a b l e namearg2 = por t number
10 arg3 = max # of nodes to cons ider”””
import sysimport networkx as NX
15 import MySQLdbfrom g r aph u t i l s import *class Node :
”””Class to hold prope r t i e s o f nodes ”””20 i n deg r e e = 0
out degr ee = 0degree = 0
81
c l u s t e r i n g = 0be tweenne s s c en t r a l i t y = 0
25 d e g r e e c e n t r a l i t y = 0c l o s e n e s s c e n t r a l i t y = 0e i g e n v e c t o r c e n t r a l i t y = 0e c c e n t r i c i t y = 0i s c e n t e r = 0
30 i s p e r i p h e r y = 0
i f l en ( sys . argv ) != 4 :35 sys . e x i t ( ”Provide tab l e name , port number , and # nodes at command l i n e \n” )
tab l e = sys . argv [ 1 ]port = sys . argv [ 2 ]n max = in t ( sys . argv [ 3 ] )
40 # MySQL connect iontry : conn = MySQLdb . connect ( host = ” l o c a l h o s t ” , user = ” root ” , passwd = ”pass ” ,db = ”
data ” )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )cur sor = conn . cur sor ( )
45 s q l = ”SELECT sip , dip , sport , dport FROM %s WHERE spor t=%s OR dport=%s” % ( table ,port , port )
try : cur sor . execute ( s q l )except MySQLdb . Error , e : sys . e x i t ( ”Error %d : %s” % ( e . args [ 0 ] , e . a rgs [ 1 ] ) )
# Create a d i r e c t e d graph from SQL r e s u l t s50 G = NX. DiGraph (name=”%s %s” % ( port , n max) )
for i in range ( cur sor . rowcount ) :r = cur sor . f e tchone ( )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )else :
55 i f G. order ( ) < n max : G. add node ( r [ 0 ] )i f G. order ( ) < n max : G. add node ( r [ 1 ] )i f r [ 0 ] in G and r [ 1 ] in G: G. add edge ( r [ 0 ] , r [ 1 ] )
# Calcu la t e graph prope r t i e s60 myNodes = {}
for n in G. nodes ( ) :myN = Node ( )# Basic Proper t i e smyN. degree = G. degree (n )
65 myN. out degr ee = G. out degr ee (n)myN. i n deg r e e = G. i n deg r e e (n )myNodes [ n ] = myN
”””70 The f o l l ow i n g measures are a l l based on undi rec t ed graphs , but are
connected components .”””
H = G. to und i r e c t ed ( )75 CCS = NX. connected component subgraphs (H)
for i in range ( l en (CCS) ) :i f CCS[ i ] . order ( ) >= 2 :
c l = NX. c l u s t e r i n g (CCS[ i ] , w i t h l a b e l s=True )for k , v in c l . i t e r i t em s ( ) : myNodes [ k ] . c l u s t e r i n g = v
80bc = NX. be tweenne s s c en t r a l i t y (CCS[ i ] )for k , v in bc . i t e r i t em s ( ) : myNodes [ k ] . b e tweenne s s c en t r a l i t y = v
dc = NX. d e g r e e c e n t r a l i t y (CCS[ i ] )
82
85 for k , v in dc . i t e r i t em s ( ) : myNodes [ k ] . d e g r e e c e n t r a l i t y = v
cc = NX. c l o s e n e s s c e n t r a l i t y (CCS[ i ] )for k , v in cc . i t e r i t em s ( ) : myNodes [ k ] . c l o s e n e s s c e n t r a l i t y = v
90 ec = e i g e n v e c t o r c e n t r a l i t y (CCS[ i ] )for k , v in ec . i t e r i t em s ( ) : myNodes [ k ] . e i g e nv e c t o r c e n t r a l i t y = v
d = NX. diameter (CCS[ i ] )r = NX. rad ius (CCS[ i ] )
95 ecc = NX. e c c e n t r i c i t y (CCS[ i ] , w i t h l a b e l s=True )for k , v in ecc . i t e r i t em s ( ) :
myNodes [ k ] . e c c e n t r i c i t y = vi f v == d : myNodes [ k ] . i s p e r i p h e r y = 1i f v == r : myNodes [ k ] . i s c e n t e r = 1
100else : pass
# Print r e s u l t sfor k , v in myNodes . i t e r i t em s ( ) :
105 s = ””s += ”%s ,%d,%d,%d,%f ,%f ,%f ,%f ,%f ,%d,%d,%d” % ( port , v . i n degr ee , v . out degree , v .
degree , v . c l u s t e r i ng , v . b e tweenne s s c en t r a l i t y , v . d e g r e e c e n t r a l i t y , v .c l o s e n e s s c e n t r a l i t y , v . e i g e nv e c t o r c e n t r a l i t y , v . e c c e n t r i c i t y , v .i s p e r i phe r y , v . i s c e n t e r )
print s
Listing B.4: motif results.py – parses FANMOD results for significant motifs
#!/ usr / bin /python
”””This program reads FANMOD r e s u l t f i l e s and l ook s f o r s i g n i f i c a n t mot i f s .
5 I t a s s o c i a t e s each mot i f wi th an i d e n t i f y i n g i n t e g e r ID and p i c k l e sthe r e s u l t s f o r l a t e r use”””
import p i c k l e10 import glob
import ppr int
f i l e d i r = ”/home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /”f i l e s = glob . g lob ( ’ /home/ eddie / r e s ea r ch /fanmod/ r e s c s v s /* . tx t ’ )
15s i z e 3 = {} # mapping f o r s i z e 3 mot i f sid3 = 0 # f i r s t ID for s i z e 3s i z e 4 = {} # mapping f o r s i z e 4 mot i f sid4 = 0 # f i r s t ID for s i z e 4
20 p thr esh = 0.0 # get mot i f s wi th pvalue <= p thr e shpct occ = 1.0 # get mot i f s wi th f requency >= pct occ
# I t e r a t e through f i l e s and make ID as soc i a t i on sfor i in range ( l en ( f i l e s ) ) :
25 i n F i l e = f i l e s [ i ]msize = in t ( i n F i l e [ −14] ) # Motif s i z e i s s tored in f i lenamef = open ( i nF i l e , ’ r ’ )f i l e = [ ]for l in f :
83
30 l = l [ : −1 ]i f l en ( l ) > 1 : f i l e . append ( l )
# Ignore s t u f f at top o f f i l ef i l e = f i l e [ 2 4 : ]
35 for j in range (0 , l en ( f i l e ) , msize ) :adjMatrix = ””l 1 = f i l e [ j ] . s p l i t ( ’ , ’ )i f ( f l o a t ( l 1 [ 6 ] ) <= p thresh ) and ( f l o a t ( l 1 [ 2 ] [ : − 1 ] ) >= pct occ ) :
# I f t h i s i s a s i g n i f i c a n t mot i f . . .40 adjMatrix += l1 [ 1 ]
for k in range (1 , msize ) :adjMatrix += f i l e [ j+k ] . s p l i t ( ’ , ’ ) [ 1 ]
i f msize == 3 and adjMatrix not in s i z e 3 . va lues ( ) :s i z e 3 [ id3 ] = adjMatrix
45 id3 += 1i f msize == 4 and adjMatrix not in s i z e 4 . va lues ( ) :
s i z e 4 [ id4 ] = adjMatrixid4 += 1
50 f . c l o s e ( ) # Close f i l e handle
# Pi c k l e the r e s u l t i n g d i c t i o n a r i e ss3 = open ( ’ s3map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze3 , s3 )
55 s3 . c l o s e ( )s4 = open ( ’ s4map . pkl ’ , ’w ’ )p i c k l e . dump( s i ze4 , s4 )s4 . c l o s e ( )
Listing B.5: motif profiles.py – creates motif profiles from FANMOD dump files
#!/ usr / bin /python
”””This f i l e reads the p i c k l e d s3 and s4 maps and c r ea t e s the b inary
5 moti f p a r t i c i p a t i on p r o f i l e s f o r the NN c l u s t e r i n g”””
import sysimport p i c k l e
10 from s t r i n g import s p l i tfrom ppr int import ppr intfrom glob import glob
class p r o f i l e :15 ””” Ins tances o f p r o f i l e s ”””
def i n i t ( s e l f , id , l ) :s e l f . ID = ids e l f . l a b e l = ls e l f . a = [ ]
20 for i in range ( l en ( s3map) + len ( s4map) ) :s e l f . a . append (0)
def mark ( s e l f , m) :try : s e l f . a [ adjM [m] ] = 1
25 except KeyError : pass # in s i g n i f i c a n t motif , not in our d i c t
84
# Unpick le adjMatrix mappings3 = open ( ’ s3map 1pct . pkl ’ , ’ r ’ )
30 s3map = p i c k l e . load ( s3 )s3 . c l o s e ( )s4 = open ( ’ s4map 1pct . pkl ’ , ’ r ’ )s4map = p i c k l e . load ( s4 )s4 . c l o s e ( )
35# Create d i c t i onary f o r adjMatrix mappingadjM = {}idx = 0for k , v in s3map . i t e r i t em s ( ) :
40 adjM [ v ] = idxidx += 1
for k , v in s4map . i t e r i t em s ( ) :adjM [ v ] = idxidx += 1
45seen = {} # d i c t f o r nodesf i l e s = glob ( ’ /home/ eddie / r e s ea r ch /fanmod/ data new l c / dumpf i l es /* ’ )for i in range ( l en ( f i l e s ) ) :#for i in range (3 ,4) :
50 # Open f i l e f o r readingf = open ( f i l e s [ i ] , ’ r ’ )# need a unique p r e f i x s ince we w i l l have mu l t i p l e node 0 , 1 , 2 , e t c . . .p r e f i x = ( f i l e s [ i ] . s p l i t ( ”/” ) [ 7 ] ) . s p l i t ( ” ” ) [ : −2 ]tmp = ””
55 for j in range ( l en ( p r e f i x ) ) : tmp += pr e f i x [ j ] + ” ”p r e f i x = tmpl a b e l = p r e f i x . s p l i t ( ” ” ) [−2]for l i n e s in f :
l = l i n e s . s p l i t ( ” , ” )60 # ignore header l i n e s in dump f i l e s
i f l en ( l ) > 2 :for j in range (1 , l en ( l ) ) :
myNode = p r e f i x + s t r ( i n t ( l [ j ] ) )i f myNode not in seen :
65 seen [myNode ] = p r o f i l e (myNode , l a b e l )seen [myNode ] . mark ( s t r ( l [ 0 ] ) )
f . c l o s e ( ) # c lo s e f i l e handle
70 for k , v in seen . i t e r i t em s ( ) :print v . ID , v . l abe l ,for i in range ( l en ( v . a ) ) :
i f i < l en ( v . a )−1: print v . a [ i ] ,else : print v . a [ i ]
Appendix C: Test Parameters
Parameter [default] Value
subgraph (motif) size [default: 3] 3/4
# of samples used to determine approx. # of subgraphs [100000] 100000
full enumeration? 1(yes)/0(no) [1] 1 (yes)
directed? 1(yes)/0(no) [1] 1 (yes)
colored vertices? 1(yes)/0(no) [0] 1 (yes)
colored edges? 1(yes)/0(no) [0] 0 (no)
random type: 0(no regard)/1(global const)/2(local const) [2] 2
regard vertex colors? 1(yes)/0(no) [0] 1 (yes)
regard edge colors? 1(yes)/0(no) [0] 0 (no)
reestimate subgraph number? 1(yes)/0(no) [0] 0 (no)
# of random networks [1000] 5000
# of exchanges per edge [3] 5
# of exchange attempts per edge [3] 5
Table C.1: FANMOD test parameters
85
86
Listing C.1: GA weights.xml – RapidMiner process parameters for genetic algorithmand 1-NN classification
<?xml version=” 1.0 ” encoding=”UTF−8”?><pr oc e s s version=” 4.2 ”>
<operator name=”Root” c l a s s=”Process ” expanded=” yes ”>5 <operator name=”ExampleSource” c l a s s=”ExampleSource”>
<parameter key=” a t t r i b u t e s ” value=”/home/ eddie / r e s ea r ch /fanmod/mo t i f a n a l y s i s /weighted / mot i f s . aml”/>
</ operator><operator name=”Evolut ionaryWeighting” c l a s s=”Evolut ionaryWeighting” expanded=”
yes ”><parameter key=” c r o s s ove r type ” value=” s h u f f l e ”/>
10 <parameter key=” k e ep b e s t i n d i v i d u a l ” value=” true ”/><parameter key=” p c r o s s ove r ” value=” 0 .6 ”/><parameter key=” popu l a t i o n s i z e ” value=”20”/><parameter key=” tournament s i ze ” value=” 0 .2 ”/><operator name=”WeightingChain” c l a s s=”OperatorChain ” expanded=” yes ”>
15 <operator name=”XValidation ” c l a s s=”XValidation ” expanded=” yes ”><parameter key=” keep example s et ” value=” true ”/><operator name=”NearestNeighbors ” c l a s s=”NearestNeighbors ”>
<parameter key=” keep example s et ” value=” true ”/></ operator>
20 <operator name=”ApplierChain ” c l a s s=”OperatorChain ” expanded=” yes ”><operator name=”ModelAppl ier” c l a s s=”ModelAppl ier”>
< l i s t key=” app l i c a t i on pa r amete r s ”></ l i s t>
< l i s t key=” pr ed i c t i on pa r amete r s ”>25 </ l i s t>
</ operator><operator name=”Performance” c l a s s=”Performance”></ operator><operator name=”PerformanceWriter ” c l a s s=”PerformanceWriter ”>
30 <parameter key=” p e r f o rman c e f i l e ” value=”/home/ eddie /r e s ea r ch /fanmod/ mo t i f a n a l y s i s /weighted / mot i f s . per ”/>
</ operator></ operator>
</ operator></ operator>
35 </ operator></ operator>
</ p r oc e s s>
Appendix D: Additional Classification Results
AIM DNS HTTP Kazaa MSDS Netbios SSH PrecisionAIM 8 0 0 0 1 0 1 80.00%DNS 0 8 0 2 2 0 0 66.67%HTTP 2 0 9 0 1 0 1 69.23%Kazaa 0 1 0 2 0 0 0 66.67%MSDS 0 0 0 0 6 0 0 100.00%Netbios 0 1 1 1 0 10 0 76.92%SSH 0 0 0 0 0 0 8 100.00%Recall: 80.00% 80.00% 90.00% 20.00% 60.00% 100.00% 80.00% � 78.46%
Table D.1: Confusion matrix of 65 application graphs using motif frequencies
True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 PrecisionPred. 5190: 453 27 63 36 84 41 316 44.41%Pred. 53: 1 621 3 0 4 15 8 95.25%Pred. 80: 23 8 712 1 9 26 4 90.93%Pred. 1214: 1 1 2 361 3 0 0 98.10%Pred. 445: 13 3 2 2 282 8 36 81.50%Pred. 137: 2 19 15 0 3 669 0 94.49%Pred. 22: 7 1 3 0 15 1 36 57.14%Recall: 90.60% 91.32% 89.00% 90.25% 70.50% 88.03% 9.00% � 79.54%
Table D.2: Confusion matrix of weighted traditional graph measures
True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 PrecisionPred. 5190: 298 8 56 0 18 0 32 72.33%Pred. 53: 7 632 3 9 2 0 4 96.19%Pred. 80: 120 14 676 0 19 3 23 79.06%Pred. 1214: 5 0 1 370 5 34 1 88.94%Pred. 445: 2 4 15 2 269 1 1 91.50%Pred. 137: 0 1 0 0 2 700 0 99.57%Pred. 22: 36 0 19 1 57 2 94 44.98%Recall: 63.68% 95.90% 87.97% 96.86% 72.31% 94.59% 60.65% � 85.70%
Table D.3: Confusion matrix of weighted motif profiles
87
Vita
Edward G. Allan, Jr.
Personal
• 4259 Cezanne Cir. Ellicott City, MD 21042email: [email protected]: (443) 812-6232
Education
• Master of Science, Computer ScienceWake Forest University, Winston-Salem, NCDecember 2008Thesis: “Identifying Application Protocols in Computer Networks Using VertexProfiles”GPA: 3.79
• Bachelor of Science, Computer ScienceWake Forest University, Winston-Salem, NCDecember 2006GPA: 3.51
Publication
• Allan, Edward G., Horvath, Michael R., Kopek, Christover V., Lamb, Brian T.,Whaples, Thomas S., and Berry, Michael W.: Anomaly Detection Using NonnegativeMatrix Factorization, Survey of Text Mining II, Springer, 203–217, 2008
Experience
• Research AssistantWake Forest University, Winston-Salem, NCAugust 2007 – December 2008Worked with Dr. Errin Fulp on various projects in computer security. Researchedtopics in computer networks leading to masters thesis. Assisted in classroom and labduties for networking class.
• Software Development InternGreatWall Systems, Inc., Winston-Salem, NCJune 2007 – August 2007Designed and programmed a testing platform for new high-speed firewall product.
88
89
Implemented portions of firewall software in the Python programming language toallow firewall policies to be swapped in place with no gap in coverage.
• Intern – R&D teamTenable Network Security, Columbia, MDJune 2006 – August 2006Developed, implemented, and tested Nessus vulnerability scanner plugins. Imple-mented code for the Tenable Log Correlation Engine product using a proprietarylanguage. Analyzed, assessed, and scored software vulnerabilities according to theCommon Vulnerability Scoring System for use with the Nessus Vulnerability scanner.
Honors
• Inducted into the Upsilon Pi Epsilon honor society in 2005
• Graduated cum laude from Wake Forest University in 2006
• 2nd place in the 2007 SIAM text mining competition