using honeypots to analyse anomalous internet activities · using honeypots to analyse anomalous...

Using Honeypots to AnalyseAnomalous Internet Activities

Saleh Ibrahim Bakr Almotairi

Bachelor of Science (Computer Science), KSU, Saudi Arabia 1992Master of Engineering (Software Engineering), UQ, Australia 2004

Thesis submitted in accordance with the regulations forDegree of Doctor of Philosophy

Information Security Institute

Faculty of Science and Technology

Queensland University of Technology

June 2009

Keywords

Internet traffic analysis, low-interaction honeypots, packet inter-arrival times, prin-cipal component analysis, square prediction error, residual space.

i

Abstract

Monitoring Internet traffic is critical in order to acquire a good understandingof threats to computer and network security and in designing efficient computersecurity systems. Researchers and network administrators have applied severalapproaches to monitoring traffic for malicious content. These techniques includemonitoring network components, aggregating IDS alerts, and monitoring unusedIP address spaces. Another method for monitoring and analyzing malicious traffic,which has been widely tried and accepted, is the use of honeypots. Honeypots arevery valuable security resources for gathering artefacts associated with a variety ofInternet attack activities. As honeypots run no production services, any contactwith them is considered potentially malicious or suspicious by definition. Thisunique characteristic of the honeypot reduces the amount of collected traffic andmakes it a more valuable source of information than other existing techniques.

Currently, there is insufficient research in the honeypot data analysis field. Todate, most of the work on honeypots has been devoted to the design of new honey-pots or optimizing the current ones. Approaches for analyzing data collected fromhoneypots, especially low-interaction honeypots, are presently immature, whileanalysis techniques are manual and focus mainly on identifying existing attacks.This research addresses the need for developing more advanced techniques for an-alyzing Internet traffic data collected from low-interaction honeypots. We believethat characterizing honeypot traffic will improve the security of networks and, ifthe honeypot data is handled in time, give early signs of new vulnerabilities orbreakouts of new automated malicious codes, such as worms.

The outcomes of this research include:

• Identification of repeated use of attack tools and attack processes throughgrouping activities that exhibit similar packet inter-arrival time distributionsusing the cliquing algorithm;

• Application of principal component analysis to detect the structure of attack-

iii

ers’ activities present in low-interaction honeypots and to visualize attackers’behaviors;

• Detection of new attacks in low-interaction honeypot traffic through the useof the principal component’s residual space and the square prediction errorstatistic;

• Real-time detection of new attacks using recursive principal component anal-ysis;

• A proof of concept implementation for honeypot traffic analysis and real timemonitoring.

iv

Dedication

This thesis is dedicated to my parents Ibrahim and Fatima whohave inspired and encouraged me throughout my life. To my wifeMedawi for her understanding and constant support over all theseyears of my PhD study.

v

Contents

Keywords i

Abstract iii

Dedication v

Table of Contents vii

List of Figures xiii

List of Tables xv

List of Abbreviations xvii

Declaration xix

Previously Published Material xxi

Acknowledgment xxiii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Research Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 72.1 Internet Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 TCP/IP Suite . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Traffic Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Network Monitoring and Traffic Collection Techniques . . . . . . . 102.2.1 Network Firewall . . . . . . . . . . . . . . . . . . . . . . . . 10

vii

2.2.2 Intrusion Detection Systems . . . . . . . . . . . . . . . . . . 112.2.3 Network Flow Monitoring . . . . . . . . . . . . . . . . . . . 112.2.4 Black Hole Monitoring . . . . . . . . . . . . . . . . . . . . . 12

2.3 Global Monitoring Projects . . . . . . . . . . . . . . . . . . . . . . 132.3.1 DShield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.2 Network Telescopes . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 The Internet Motion Sensor . . . . . . . . . . . . . . . . . . 142.3.4 The Leurré.com Project . . . . . . . . . . . . . . . . . . . . 142.3.5 SGNET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Traffic Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 Statistical Techniques . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Honeypots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.1 Low-interaction vs High-interaction Honeypots . . . . . . . . 232.5.2 Production vs Research Honeypots . . . . . . . . . . . . . . 242.5.3 Physical vs Virtual Honeypots . . . . . . . . . . . . . . . . . 252.5.4 Server Side vs Client Side Honeypots . . . . . . . . . . . . . 262.5.5 Improving Honeypots While Lowering Their Risks . . . . . 272.5.6 Honeypot Traffic Anomalies . . . . . . . . . . . . . . . . . . 282.5.7 Existing Honeypot Solutions . . . . . . . . . . . . . . . . . . 29

2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.6.1 Research Outcomes from the Leurré.com Project . . . . . . 302.6.2 Application of Principal Component Analysis to Internet

Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.6.3 Research Challenges . . . . . . . . . . . . . . . . . . . . . . 33

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Traffic Analysis Using Packet Inter-arrival Times 393.1 Information Source . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.1 The Leurré.com Honeypot Platform . . . . . . . . . . . . . 403.1.2 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Preliminary Investigation of Packet Inter-arrival Times . . . . . . . 423.3 Cluster Correlation Using Packet Inter-arrival Times . . . . . . . . 45

3.3.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.2 Measuring Similarities . . . . . . . . . . . . . . . . . . . . . 47

viii

3.3.3 Cliquing Algorithm . . . . . . . . . . . . . . . . . . . . . . . 493.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.1 Type I Cliques . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.2 Type II Cliques . . . . . . . . . . . . . . . . . . . . . . . . . 533.4.3 Type III Cliques . . . . . . . . . . . . . . . . . . . . . . . . 553.4.4 Supercliques . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Honeypot Traffic Structure 594.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 614.3 Data set and Pre-Processing . . . . . . . . . . . . . . . . . . . . . 63

4.3.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.3.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.3 Candidate Feature Selection . . . . . . . . . . . . . . . . . . 65

4.4 PCA on the Honeypot Data set . . . . . . . . . . . . . . . . . . . . 654.4.1 Number of Principal Components to Retain . . . . . . . . . 68

4.5 Interpretation of the Results . . . . . . . . . . . . . . . . . . . . . . 694.6 Interrelations Between Components . . . . . . . . . . . . . . . . . . 714.7 Identification of Extreme Activities . . . . . . . . . . . . . . . . . . 73

4.7.1 A Discussion of the Detected Outliers . . . . . . . . . . . . . 784.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Detecting New Attacks 815.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.2 Principal Component’s Residual Space . . . . . . . . . . . . . . . . 82

5.2.1 Square Prediction Error (SPE) . . . . . . . . . . . . . . . . 835.3 Data set and Pre-Processing . . . . . . . . . . . . . . . . . . . . . 84

5.3.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.3.2 Processing the Flow Traffic via PCA . . . . . . . . . . . . . 855.3.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.3.4 Setting up Model Parameters . . . . . . . . . . . . . . . . . 875.3.5 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 90

5.4 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . 925.4.1 PCA Model Construction . . . . . . . . . . . . . . . . . . . 935.4.2 Future Traffic Testing . . . . . . . . . . . . . . . . . . . . . 100

ix

5.5 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 1035.5.1 Detection and Identification . . . . . . . . . . . . . . . . . . 1035.5.2 Stability of the Monitoring Model Over Time . . . . . . . . 1045.5.3 Computational Requirements . . . . . . . . . . . . . . . . . 1055.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 Automatic Detection of New Attacks 1096.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Principal Component Analysis Model . . . . . . . . . . . . . . . . . 111

6.2.1 Building the Initial PCA Detection Model . . . . . . . . . . 1116.2.2 Recursive Adaptation of the Detection Model . . . . . . . . 1126.2.3 Setting the Thresholds . . . . . . . . . . . . . . . . . . . . . 114

6.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.3.1 Detecting New Attacks and Updating the Model . . . . . . . 1166.3.2 Model Sensitivity to New Attacks . . . . . . . . . . . . . . . 117

6.4 A Proof of Concept Implementation . . . . . . . . . . . . . . . . . . 1186.4.1 Flow Aggregator . . . . . . . . . . . . . . . . . . . . . . . . 1186.4.2 Monitoring Desktop: HoneyEye . . . . . . . . . . . . . . . . 1186.4.3 Deployment Scenario: Single Site . . . . . . . . . . . . . . . 1206.4.4 Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.5.1 Projection of the Testing Data: No Adaptation . . . . . . . 1236.5.2 Projection of the Testing Data: With Adaptation . . . . . . 1246.5.3 The Effects of Adaptation on Threshold Values . . . . . . . 1256.5.4 The Effect of Adaptation on Variables . . . . . . . . . . . . 125

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Conclusion and Future Work 1297.1 Improving the Leurré.com Clusters . . . . . . . . . . . . . . . . . . 1307.2 Structuring Honeypot Traffic . . . . . . . . . . . . . . . . . . . . . . 1317.3 Detecting New Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 1317.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

A Matlab Code 135A.1 Extracting the Principal Components . . . . . . . . . . . . . . . . . 135

x

A.2 Robustification Using Squared Mahalanobios Distance M2 . . . . . 135A.3 Estimate Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 136A.4 Recursive Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A.5 Recursive Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 138A.6 Recursive Normalization . . . . . . . . . . . . . . . . . . . . . . . . 139

xi

List of Figures

2.1 Network telescope setup. . . . . . . . . . . . . . . . . . . . . . . . . 142.2 SGNET Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 An example of a virtual honeypot setup that emulates two operating

systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Leurré.com honeypot platform architecture. . . . . . . . . . . . . . 403.2 Illustration of the port sequence of an attack. . . . . . . . . . . . . 413.3 Illustration of packet inter-arrival times. . . . . . . . . . . . . . . . 433.4 A global distribution of all IAT distribution < 100000 seconds. . . 443.5 IAT distribution values that range from 0 to 10000 seconds. . . . . 443.6 A time series conversion using SAX. . . . . . . . . . . . . . . . . . . 483.7 An example of finding cliques. . . . . . . . . . . . . . . . . . . . . . 493.8 The different steps of the cliquing algorithm. . . . . . . . . . . . . . 51

4.1 Directions of maximal variance of principal components (Z1, Z2). . 624.2 Scree plot of eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . 694.3 The scatter plot of TCP scan (PC2) vs live machine detection (PC5). 724.4 The scatter plot of the first two principal components. . . . . . . . 744.5 The scatter plot of the last two components. . . . . . . . . . . . . . 754.6 The ellipse of a constant distance. . . . . . . . . . . . . . . . . . . . 764.7 The scatter plot of the statistics Di vs. (M2

i −Di). . . . . . . . . . 77

5.1 Scree plot of eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . 875.2 Robustification of the correlation matrix through multivariate trim-

ming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.3 Detection model architecture. . . . . . . . . . . . . . . . . . . . . . 905.4 Steps for building the PCA model (Phase I) . . . . . . . . . . . . . 915.5 Steps for detecting new attacks (Phase II). . . . . . . . . . . . . . . 925.6 Plot of SPE values of the training and testing traffic. . . . . . . . . 103

xiii

5.7 Plot of four-month attack data projected onto the residual space. . 104

6.1 Adaptive detection model process flow. . . . . . . . . . . . . . . . . 1156.2 Detecting new attacks. . . . . . . . . . . . . . . . . . . . . . . . . . 1166.3 Residual space sensitivity to new attacks. . . . . . . . . . . . . . . . 1176.4 HoneyEye interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.5 Overview of a real-time deployment. . . . . . . . . . . . . . . . . . 1216.6 Detection charts, with no adaptation: using SPE statistics (upper

chart) and using T2 statistic (lower chart). . . . . . . . . . . . . . . 1236.7 Two detection charts with 14-day adaptation: using SPE statistics

(upper chart) and using T2 statistic (lower chart). . . . . . . . . . . 1246.8 SPE and T2 limit evolution over time using 14-day adaptation. . . . 1256.9 SPE limit evolution over time using 14-day adaptation (top) along

with the mean of six selected variables. . . . . . . . . . . . . . . . . 126

xiv

List of Tables

3.1 Distinct sources and destinations of the top ten IATs. . . . . . . . . 453.2 Bin values of IAT ranges. . . . . . . . . . . . . . . . . . . . . . . . . 463.3 A summary of Type I Cliques. . . . . . . . . . . . . . . . . . . . . . 523.4 A summary of Type II Cliques. . . . . . . . . . . . . . . . . . . . . 543.5 A summary of Type III Cliques. . . . . . . . . . . . . . . . . . . . . 563.6 Representative properties of Supercliques. . . . . . . . . . . . . . . 57

4.1 Summary of the data set used in this study. . . . . . . . . . . . . . 644.2 Variables used in the analysis. . . . . . . . . . . . . . . . . . . . . . 664.3 The extracted principal components and their variances. . . . . . . 684.4 The extracted communalities of variables. . . . . . . . . . . . . . . 704.5 The Varimax rotation of principal components. . . . . . . . . . . . 714.6 Interpretations of the first seven components. . . . . . . . . . . . . 724.7 The top five extreme observations. . . . . . . . . . . . . . . . . . . 77

5.1 Summary of the data sets used in the study. . . . . . . . . . . . . . 845.2 Extracted principal components’ variance. . . . . . . . . . . . . . . 855.3 Sample traffic matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 935.5 Standardized traffic matrix. . . . . . . . . . . . . . . . . . . . . . . 975.6 Eigenvectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.7 Eigenvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.8 Scores of the residuals. . . . . . . . . . . . . . . . . . . . . . . . . . 995.9 SPE values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.10 Future traffic matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.11 Standardized future traffic matrix. . . . . . . . . . . . . . . . . . . . 1025.12 New traffic PC scores. . . . . . . . . . . . . . . . . . . . . . . . . . 1025.13 Average execution times of the major tasks (seconds). . . . . . . . . 1055.14 Classes of detected attack activities. . . . . . . . . . . . . . . . . . 106

xv

6.1 Summary of the data sets. . . . . . . . . . . . . . . . . . . . . . . 122

xvi

List of Abbreviations

CAIDA Cooperative Association for Intranet Data AnalysisCERT Computer Emergency Response TeamCPU Central Control UnitCUSUM Cumulative SumDDoS Distributed Denial of ServiceDoS Denial of ServiceEWMA Exponentially Weighted Moving AverageFTP File Transfer ProtocolIAT Packet Inter-arrival TimeICMP Internet Control Message ProtocolIDS Intrusion Detection SystemIMS Internet Motion SensorIP Internet ProtocolISC Internet Storm CenterKNN K-Nearest NeighborsLAN Local Area NetworkNDIS Network Intrusion Detection SystemOD Origin DestinationOS Operating SystemOSI Open System InterconnectPAA Piecewise Aggregate ApproximationPC Principal ComponentsPCA Principal Component AnalysisRPCA Recursive Principal Component AnalysisSAX Symbolic Aggregate ApproximationSMTP Simple Mail Transfer Protocol

xvii

SPE Square Prediction ErrorSSH Secure ShellSVD Singular Value DecompositionTCP Transmission Control ProtocolTCP/IP Transport Connection Protocol / Internet ProtocolUCL Upper Control LimitUDP User Datagram ProtocolUML User Mode Linux

xviii

Declaration

The work contained in this thesis has not been previously submitted to meetrequirements for an award at this or any other higher education institution. Tothe best of my knowledge and belief, the thesis contains no material previouslypublished or written by another person except where due reference is made.

Signed: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: . . . . . . . . . . . . . . . . . . . . .

xix

Previously Published Material

The following papers have been published or presented, and contain material basedon the content of this thesis.

• S. Almotairi, A. Clark, M. Dacier, C. Leita, G. Mohay, V. H. Pham, O.Thonnard, and J. Zimmermann, ”Extracting Inter-arrival Time Based Be-haviour from Honeypot Traffic using Cliques”, in the 5th Australian DigitalForensics Conference, Perth, Australia, 2007.

• S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, ”Characterization ofAttackers’ Activities in Honeypot Traffic Using Principal Component Anal-ysis”, in Proceedings of the 2008 IFIP International Conference on Networkand Parallel Computing Shanghai, China: IEEE Computer Society, 2008.

• S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, ”A Technique for De-tecting New Attacks in Low-Interaction Honeypot Traffic”, in Proceedings ofthe Fourth International Conference on Internet Monitoring and Protection,Venice, Italy: IEEE Computer Society 2009.

xxi

Acknowledgment

Praise and thanks be to Allah for his help in accomplishing this work.This thesis would not be successful without the assistance and support of the

following individuals and organizations. I am grateful to them all.I would like to thank my supervisors Associate Professor Andrew Clark, Ad-

junct Professor George Mohay, and Dr. Jacob Zimmermann for their patience,guidance, and support in the completion of this work. Thank you Andrew andGeorge for all the help you have given me during my research. Indeed, your sug-gestions and advice made the completion of this work possible.

Acknowledgment is due to the National Information Center at the Ministry ofInterior in Saudi Arabia for sponsoring my research. I am also very grateful to theInformation Security Institute (ISI) at the Queensland University of Technology forproviding the resources and environment for conducting this research. Additionalthanks go to the Leurré.com honeypot project led by Marc Dacier for providingthe honeypot data that made this work possible.

I would also like to thank my colleagues at the Information Security Institute,for providing an environment in which I could learn and work as a researcher.

xxiii

Chapter 1

Introduction

People and businesses alike are dependent on the Internet to communicate and runtheir businesses. The growing dependencies on the Internet are matched at thesame time by the rising rate of attacks. Computers and networks connected to theInternet are vulnerable to a variety of threats that can compromise their intendedoperations, such as viruses, worms, and denial of service attacks. There are manyreasons for the growing number and severity of attacks, including increased con-nectivity and increased availability of vulnerability information and attack scriptsvia the Internet.

As the nature of Internet attacks is unpredictable, security managers need toimplement multiple layers of security defence as part of the Defence-in-Depth pro-tection strategy, such as firewalls, monitoring tools, vulnerability scanning tools,and intrusion detection systems. Firewalls are commonly used to protect localnetworks from the outside world through controlling traffic flow between the localnetwork and the Internet. While firewalls protect local networks from the Internet,they have many limitations which include:

• they cannot see local traffic;

• they are vulnerable to misconfiguration; and

• they stop only network level attacks and are less effective in preventing ap-plication level attacks that target open ports such as TCP port 80.

Network intrusion detection system (NIDS) are another component in the Defence-in-Depth protection strategy. NIDS are used to detect malicious traffic within

1

2 Chapter 1. Introduction

networks, based on predefined attack signatures or less generally on anomaly-basedmethods. Network intrusion detection systems also have their own limitationswhich include:

• the need for accurate signatures of attacks in order to work properly;

• the inability to detect, in the case of signature-based IDS, new and unseenattacks;

• the generation of a large number of alerts that need to be to investigated;and

• the inability to handle encrypted traffic.

Recently, honeypots have gained popularity within the security community in pro-viding an additional layer of network security. Honeypots are decoy computers thatrun no real services and can serve to complement other security systems throughtheir ability to capture new attacks and to see encrypted traffic. In addition,honeypots, by definition, only collect malicious traffic, and therefore reduce thegeneration of false alarms. Honeypot applications for network security include theautomatic collection of malware [12], detection of zero-day attacks [1], detectionof worms [50] and the automatic generation of intrusion detection signatures [83].

Low-interaction honeypots are the simplest form of honeypot. They run no realoperating system and usually offer only an emulated network stack with limitedor no service interactions. The advantages of using low-interaction honeypots aretheir ease of deployment and their low level of risk. The Leurré.com project is aworldwide deployment of low-interaction honeypots [9] for collecting attack datathat targets machines and networks that are connected to the Internet. The hon-eypot traffic data used in this thesis comes from the Leurré.com project. Analyzingthis traffic has proved to be very useful in characterizing global malicious Internetactivity. Various types of analysis have been carried out on honeypot traffic datacollected from the project, to characterize different Internet attack activities andto unveil useful attack patterns [151, 112].

This research extends previous research on improving honeypot traffic analysis,introduces new techniques for the characterization of malicious Internet activitiesand automates the analysis and discovery of new attacks. The rest of this chapteris organized as follows. Section 1.1 identifies the motivation for this research.Outcomes achieved by this research are identified in Section 1.2. Finally, Section1.3 presents the outline of the thesis.

1.1. Motivation 3

1.1 Motivation

This thesis examines the problem of data analysis of traffic collected by low-interaction honeypots with the goal of identifying anomalous Internet traffic. Thisresearch is motivated by:

• the relative absence of research analyzing traffic data collected by honeypotsin general and low-interaction honeypots in particular;

• the need for new detection techniques that suit the type of traffic data col-lected by low-interaction honeypots, which is considered suspicious by defi-nition, and which is both multidimensional and sparse;

• the need for a real-time detection capability for new attacks with reduced orno human intervention that is:

– able to capture new trends and adapt to the dynamic nature of theInternet;

– low in computational resource requirements and thus suitable for real-time application.

1.2 Research Outcomes

The aim of this thesis is to research and develop advanced techniques for identifyingand analyzing anomalous Internet activities in honeypot traffic. This research hasresulted in a number of significant improvements to honeypot traffic analysis. Theoutcomes of this research are five-fold:

• improving the Leurré.com clusters through the use of packet inter-arrivaltime (IAT) distributions and the cliquing algorithm to group similar attackactivities, or clusters of attacks, based on similar IAT behaviors. The re-search results were published in:

– S. Almotairi, A. Clark, M. Dacier, C. Leita, G. Mohay, V. H. Pham, O.Thonnard, and J. Zimmermann, ”Extracting Inter-arrival Time BasedBehaviour from Honeypot Traffic using Cliques”, in the 5th AustralianDigital Forensics Conference, Perth, Australia, 2007;


• the successful application of principal component analysis (PCA) in detectingthe structure of attackers’ activities in honeypot traffic, in visualizing theseactivities, and in identifying different types of outliers. The research findingswere presented in:

– S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, ”Characteriza-tion of Attackers’ Activities in Honeypot Traffic Using Principal Com-ponent Analysis”, in Proceedings of the 2008 IFIP International Con-ference on Network and Parallel Computing Shanghai, China: IEEEComputer Society, 2008;

• the proposal of a detection technique that is capable of detecting new attacksin low-interaction honeypot traffic using the residuals of principal componentanalysis (PCA) and the square prediction error (SPE) statistic. The researchresults were published in:

– S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, ”A Techniquefor Detecting New Attacks in Low-Interaction Honeypot Traffic”, inProceedings of the Fourth International Conference on Internet Moni-toring and Protection, Venice, Italy: IEEE Computer Society 2009;

• the design of an automatic detection model that is capable of detecting newattacks, capturing new changes, and updating its parameters automatically;and

• the implementation of a proof of concept system for analyzing honeypottraffic and providing a real-time monitoring application.

1.3 Thesis Outline

The rest of this thesis is organized as follows:

Chapter 2: Background This chapter provides an overview of honeypotconcepts and technologies. This chapter also explores existing data collectiontechniques and monitoring methods of anomalous Internet traffic. Research inanalyzing anomalous Internet traffic is also identified. In addition, this chapterdiscusses related work relevant to that described in Chapters 4-6.

1.3. Thesis Outline 5

Chapter 3: Traffic Analysis Using Packet Inter-Arrival Times Thischapter gives a brief introduction to the Leurré.com project setup and its method-ology in collecting and processing honeypot traffic. In addition, it details a method-ology for improving the Leurré.com clusters by grouping clusters that share similartypes of activities based on packet inter-arrival time (IAT) distributions. A num-ber of cliques have been generated using the IAT of clusters that represent a varietyof interesting activities targeting the Leurré.com environments.

Chapter 4: Honeypot Traffic Structure This chapter introduces the con-cept of principal component analysis and presents a technique for characterizingattackers’ activities in honeypot traffic using principal component analysis. At-tackers’ activities in honeypot traffic are decomposed into seven dominant clusters.In addition, a visualization technique, based on principal component plots, is pre-sented to unveil the interrelationships between activities and to identify outliers.Finally, experimental results on real traffic data from the Leurré.com project arediscussed.

Chapter 5: Detecting New Attacks This chapter presents a techniquefor detecting new attacks in low-interaction honeypot traffic through the use ofprincipal component residual space and square prediction error (SPE) statistics.The effectiveness of the proposed technique is demonstrated and evaluated throughthe analysis of real traffic data from the Leurré.com project. Two data sets areused in this analysis: data set I to construct the PCA model, and data set II totest and evaluate the detection model.

Chapter 6: Automatic Detection of New Attacks This chapter ad-dresses the challenges of real-time detection of new attacks and proposes an adap-tive detection model that captures changes in Internet traffic and updates itsparameters automatically. Moreover, a proof of concept implementation of theproposed detection system for real-time and offline applications is described.

Chapter 7: Conclusion and Future Work Conclusions and directions forfuture research are presented in this chapter.

Chapter 2

Background

The goals of the thesis, as described in Chapter 1, are to research and developimproved techniques for characterizing anomalous Internet activities present inlow-interaction honeypot traffic. This chapter provides an overview of the hon-eypot concept and different types of honeypot technologies. This chapter alsoexplores the data collection techniques and monitoring methods used in identify-ing anomalous Internet traffic. Research in analyzing anomalous Internet traffic isalso identified with a particular emphasis on honeypots.

This chapter is divided into six sections. Internet protocols are discussed in Sec-tion 2.1. Existing methods for monitoring network traffic for malicious activitiesare discussed in Section 2.2. Section 2.3 highlights existing global monitoring sys-tems. Section 2.4 provides an overview of traffic analysis techniques. The conceptof honeypots is presented in Section 2.5. Section 2.6 presents previous researchrelated to work described in Chapters 3 to 6. Finally, Section 2.7 concludes theliterature review.

2.1 Internet Protocols

All Internet traffic is handled by the Internet protocol suite or what is commonlyknown as the TCP/IP (Transmission Control Protocol/Internet Protocol) protocolsuite. This thesis aims to research and develop advanced techniques for identifyingand analyzing anomalous Internet activities in honeypot traffic. This section pro-vides a brief overview of the TCP/IP protocol suite, traffic anomalies, and traffic

7

8 Chapter 2. Background

analysis techniques.

2.1.1 TCP/IP Suite

The TCP/IP (Transmission Control Protocol/Internet Protocol) protocol suite isa set of communication protocols for transmitting data over the Internet [59, 60],which is maintained by the Internet Engineering Task Force (IETF) [8]. TheTCP/IP protocol suite is hierarchical and comprises four interactive layers: link,Internet, transport and application. Network communication is achieved throughthe interaction between different layers where higher layers draw on the servicesof lower layers.

While there are many protocols that are defined by the TCP/IP suite to achieveits functionalities, the main protocols of the suite are:

• Internet Protocol (IP) [57]: IP is the main network layer protocol. It is aconnectionless protocol and is mainly responsible of delivering data betweensource and destination devices. IP’s main functionalities include: formattingdata into packets or datagram, IP addressing, handling fragmentation, andnetwork routing.

• Internet Control Message Protocol (ICMP) [56]: ICMP is a network layerprotocol for complementing the IP protocol in providing error reporting andquerying mechanisms for testing and diagnosing networks.

• Transmission Control Protocol (TCP) [58]: TCP is a connection-orientedtransport layer protocol. TCP is responsible for providing reliable communi-cations for application layer protocols and for ensuring end-to-end delivery ofdata between communicating application programs. As the IP is a connec-tionless and unreliable protocol, TCP provides the necessary functionalitiesenabling several applications to share, at the same time, the same IP addressand perform bi-directional communications over the network. TCP’s mainfunctions include providing connection control and error and flow control.Application layer protocols that utilize TCP include HTTP and SMTP.

• User Datagram Protocol (UDP) [55]: UDP is a connectionless and unreliabletransport layer protocol. Like TCP, UDP is responsible for providing end-to-end communication for applications, but with a minimal level of errorchecking and no flow control mechanism. UDP provides applications with a

2.1. Internet Protocols 9

lightweight method for sending small amounts of data where reliability is notimportant, such as broadcasting applications where a loss of some bytes willnot be noticed, or for applications that send small amounts of data whererequests are resent whenever the response is not received, for example, voiceover IP.

2.1.2 Traffic Attacks

Vulnerabilities in the TCP/IP suite exist in almost all of its layers [37], as theTCP/IP suite was not designed with security in mind. These vulnerabilities havebeen utilized by attackers against networks and systems, using attacks such asTCP SYN and ICMP flooding. Statistics [131] show that the number of networkattacks and the number of new vulnerabilities are both on the rise despite increasedefforts in the areas of software engineering and security management practices.

Several factors have contributed to the rise of attacks, including increased con-nectivity, increased financial and other incentives to launch attacks, the availabilityof vulnerability information and attack tools, the high prevalence of exploitablevulnerabilities and the lack of patches from vendors or long delays before thepatches are made available. The source of vulnerabilities can be attributed tomany factors including the design of the protocols themselves and the flawed im-plementation of these protocols.

There exist several studies to classify network anomalies and security threats toresolve confusion in describing particular attacks. Howard [71] developed a process-based taxonomy of computer and network attacks. His approach was intended todescribe the process of attacks, rather than providing attack classification. Hisapproach was to establish a link between attackers and objectives in computerand network attacks. This link is established through an operational sequence oftools, access and results.

Hansman’s taxonomy [69] is aimed at classifying and grouping attacks basedon their similarities rather than the attack process. Hansman proposed four di-mensions for attack classification. The first dimension categorizes attacks based ontheir attack vector or method of propagation. The second dimension identifies at-tacks according to their targets. The third dimension deals with the vulnerabilitiesthat the attack exploits, mainly based on the Common Vulnerabilities and Expo-sures (CVE) standardized names of vulnerabilities [3]. The fourth dimension dealswith attacks that might have extra effects or which are able to launch other at-


tacks, such as a worm carrying a Trojan in its payload. The next section examinesmethods for monitoring network traffic for detecting these malicious activities.

2.2 Network Monitoring and Traffic CollectionTechniques

The Internet has become essential for governments, universities and businesses toconduct their affairs. The reliability and availability of networks and the securityof the Internet are critical for organizations to conduct their daily work. Thesenetworks are under constantly increasing threats from different types of attacks,such as worms and denial of service attacks. Monitoring and characterizing thesethreats is crucial for protecting networks and to guarantee smooth organizationalactivities.

Broadly speaking, two methods exist for monitoring network traffic for mali-cious activities, live network and unsolicited traffic monitoring. The live monitor-ing techniques include data collected by policy enforcing systems such as firewalls,network intrusion detection systems (NIDS) logs, and traffic from network man-agement tools such as NetFlow [132]. Unsolicited traffic monitoring techniquesinclude passive monitoring of unused IP spaces, such as darknets, and the use ofactive decoy services, such as honeypots.

2.2.1 Network Firewall

A network firewall comprises software and hardware that protect one networkfrom another network. Firewalls are mainly used to filter incoming Internet trafficaccording to a predefined organizational security policy. A firewall can provideprotection across different levels of the open system interconnect (OSI) modelfor networking, such as application and network layers. Firewalls deployed atthe boundary of networks have a view of inbound and outbound traffic and thismakes them very useful in monitoring traffic. Firewalls’ logs are a rich source ofinformation about network traffic, including traffic volume, successful and rejectedconnections, traffic arrival times, IP addresses, ports and services.

2.2. Network Monitoring and Traffic Collection Techniques 11

2.2.2 Intrusion Detection Systems

Intrusion detection systems complement firewalls in monitoring network traffic andproviding another level of protection for systems. Network intrusion detection sys-tems (NIDS) are passive in their monitoring of network traffic. They detect attacksby capturing and analyzing traffic and generating alarms when the level of suspi-cion about the traffic is high. Two types of NIDS currently exist based on their de-tection methodologies, signature-based and anomaly-based NIDS. Signature-basedNIDS rely on a knowledge base of predefined patterns of attack for identifying at-tacks in the network traffic being monitored. In contrast, anomaly-based NIDSmeasure any deviation from normality and raise alarms whenever the predefinedthreshold level is exceeded. A normality profile is constructed by training thedetection model on historical network traffic that is believed to be attack free,or normal, over a period of time. An alert is then signaled whenever a large de-viation is encountered from the normality profile. While signature-based NIDSdetect only known attacks, anomaly-based NIDS are capable of detecting zero-dayattacks, but usually with a high level of false positive alarms. NIDS are passivesystems that are capable of logging very detailed information regarding suspicioustraffic.

2.2.3 Network Flow Monitoring

A network flow [132] is a unidirectional stream of packets between a given sourceand a destination. A network flow can be identified by seven key fields: sourceIP address, destination IP address, source port number, destination port number,protocol type, type of service, and the router input interface. If a packet differsfrom another packet by a single key field, it is considered to belong to another flow.While flow data was originally used for resource management and accounting, thisdata contains enough information for detecting a variety of network anomalies [33,63]. Moreover, network flows excel in their performance for real-time analysis anddetection of attacks. In this research, traffic flows were used to analyze honeypottraffic.

Barford et al. [33] conducted a visual analysis of traffic flows and categorizedtraffic anomalies into three types based on statistical characteristics of their trafficfeatures. These three types are: operational anomalies as a result of networkdevice outages and misconfiguration; flash crowd anomalies which are described


as a sudden rise in traffic to a host for a short period of time; and network abuseanomalies as a result of malicious intent which include a variety of anomalousactivities such as denial of service attacks (DOS) and worms.

Analysis of flow data has been widely used for real-time network monitoring[66, 63, 101, 33, 81]. Barford et al. [34] proposed an anomaly detection techniquethat uses network traffic flow as an input. Kim et al. [81] proposed a flow-basedmethod to detect abnormal traffic using traffic patterns of flow header information.Munz et al. [101] proposed a framework for real-time detection of attacks usingtraffic flows.

2.2.4 Black Hole Monitoring

Black holes [48] or Darknets [19] are blocks of routable IP addresses which have nolegitimate hosts deployed. Traffic targeting these blocks must be the result of mis-configuration, backscatter from spoofed source addresses, or port scanning. Thus,black hole networks provide an excellent method of studying Internet threats. Thesize of unused address space has less effect on the usability of this method; how-ever more address space would increase the visibility and accuracy of statisticalinference.

Monitoring of a Darknet can include [102] monitoring backscatter, which helpsin the analysis of denial of service attacks [100], or monitoring requests for accessto unallocated spaces, which helps in the analysis of worms. Traffic data canbe acquired using various approaches, such as exporting flow logs from routersand switch equipment or by placing a passive network monitor at the entrance ofthe network or listening on a router interface that serves the unallocated networkspace.

One drawback of the darkspace monitoring technique is the difficulty in hidingtheir deployment so that attackers are not able to avoid them. Another interestingapproach is to monitor the grey IP address space, or IP addresses that are notactive for a period of time in a large IP class. Jin et al. [76] proposed a correlationtechnique for identifying and tracking potentially suspicious hosts using grey IPspace. In the following section, we review existing monitoring projects of Internetthreats.

2.3. Global Monitoring Projects 13

2.3 Global Monitoring Projects

The following section highlights a number of projects whose aim is to monitorInternet threats along with the techniques they utilize.

2.3.1 DShield

DShield.org is a non-profit organization that was launched at the end of 2000and funded by SANS Institute as part of its Internet Storm Center (ISC) [4, 73].The DShield.org goal was to collect malicious Internet activities from all over theworld by analyzing activity trends and improving firewall rules. DShield’s dataset consists of firewall logs and IDS alerts submitted by a variety of networksfrom around the world. Trends of attacks such as top source IP addresses anddestination ports are published on daily basis.

Yegneswaran et al. [149] presented an analysis of Internet intrusion activitiesbased on data from DShield.org. In their study, they utilized packets rejectedby firewalls and port scan logs recorded by network intrusion detections systems.They investigated several features of intrusion activity features, which include:the daily volume of intrusion attempts, the sources and destinations of intrusionattempts, and specific types of intrusion attempts. They then used their results topredict intrusion activities for the entire Internet. However, research shows thatattempts to extrapolate results from small network IP spaces in order to predictthe global Internet traffic would lead to insignificant results, since different IPspaces observe different traffic patterns [48].

2.3.2 Network Telescopes

The Network Telescope is an initiative of the Cooperative Association for InternetData Analysis (CAIDA) for monitoring routable but unused IP address space[14]. The Network Telescope assumes that no legitimate traffic should be sent tomonitored space, which provides a unique basis for studying security events; anytraffic that arrives at a network telescope is either a result of malicious activities(such as backscatter from denial of service attacks, Internet worms, and networkscanning) or a misconfiguration. The network telescope has helped in studyingworm activities, such as Slammer and Code Red II, and has assisted in the analysisof large scale denial of service attacks and backscatter [100]. Figure 2.1 depictsthe network telescope setup used for backscatter collection.


Internet

/8 NetworkHub

Monitor

Figure 2.1: Network telescope setup.

2.3.3 The Internet Motion Sensor

The Internet Motion Sensor (IMS) is a global threat monitoring system for mea-suring and tracking Internet threats, such as worms and denial of service attacks[20]. The IMS is managed by the University of Michigan and consists of over 28sensors at 18 different locations. These sensors monitor blocks of routable butunused IP addresses. They are deployed across the globe at major Internet ser-vice providers, large organizations, and universities. The IP address spaces thesesensors monitor range in size from class C (256 addresses) to class A (16,777,216addresses) networks. Each sensor consists of an active and a passive component.The passive component collects received packets sent to the sensor’s monitoredaddress space and the active component manages replies to the source of receivedpackets.

A study of traffic targeting ten IMS sensors shows that there were significantdifferences in traffic observed between these sensors [30]. These differences havebeen observed over all protocols and services, over a specific protocol, and over aparticular worm signature.

2.3.4 The Leurré.com Project

Institute of Eurécom launched the Leurré.com project in 2004 for the purposeof collecting malicious traffic using globally distributed environments [9]. TheLeurré.com environments consist of deploying similar honeypot sensors at differentlocations around the world; currently 40 platforms are deployed in 25 differentcountries. Data from these honeypots was used in this research.

2.3. Global Monitoring Projects 15

On a daily basis, traffic logs are transferred to a centralized machine where rawtraffic data is processed, enriched with external data, and inserted into relationaldatabase tables. The five most important tables in the database are the following[109]:

• host table: this table contains all attributes which characterize one honeypotvirtual machine;

• environment table: this table contains all attributes which characterize onehoneypot platform; each platform consists of three hosts;

• source table: this table gathers all attributes required to characterize oneattacking IP within one day;

• large_session table: this table contains all attributes required to characterizethe activity of one source observed against one platform;

• tiny_session table: this table contains all attributes required to characterizeone source observed against one host;

• hacker_honeypot_packets table: this table tracks all packets sent by hackersto honeypots;

• honeypot_hacker_packets table: this table tracks all packets sent by hon-eypots to hackers.

The Leurré.com project provides two types of interface for accessing the database:the protected web interface which provides some useful queries for extracting datafrom the database; and direct access through secure shell SSH. Leurré.com’s plat-form architecture and data manipulation are explained in detail in Chapter 3,while research outcomes from the project are presented in Section 2.6.1.

2.3.5 SGNET

SGNET is another initiative from the Institute of Eurécom, creating an opendistributed framework of honeypots for collecting suspicious Internet traffic dataand analysing it for malicious activity [92]. The framework focuses mainly onself-propagated code and code injection attacks. The architecture of SGNET isdivided into three parts (Figure 2.2):


1. SGNET sensor: SGNET sensor is a low-interaction honeypot daemon thatinteracts with attackers mimicking real services. The interactions are han-dled by a ScriptGen system [93]. ScriptGen is a system for emulating serviceswith no prior knowledge of their behaviors, through incremental learningthat is drawn from samples of previous interactions with high-interactionhoneypots. ScriptGen uses state machines to replay previously learned re-sponses in reply to attackers’ requests. When a new request arrives that isnot captured by the emulator knowledge base, ScriptGen proxies this requestto a high-interaction honeypot to continue the conversation, which is laterincorporated in refining the emulator knowledge state machines.

2. Gateway: the gateway acts as an interface between the SGNET sensorsand the service provider components. It also balances the load of the serviceprovider components. The communication between the gateway and SGNETcomponents is achieved through Peiros, a TCP-based protocol [92].

3. Service provider: currently, the service provider has two components, thesample factories and ShellCode handlers. The sample factory is a high-interaction honeypot that is implemented using a modified version of theArgos virtualization system [1]. Any requests that cannot be answered bySGNET sensors, as they fall outside their state machines, are proxied tothe sample factory to continue these conversations with the attackers andthen pushed back by the gateway to refine the sensors’ state machines. TheShellCode handler is a modified implementation of Nepenthes [12] for han-dling shell code behaviors and network interactions. When a shell code isdetected, the payload is handed by the sensor to the ShellCode handler forfurther analysis.

The SGNET collects different level of attack information including a TCP-Dump [24] log of the SGNET sensor and gateway log. The information is thenextracted and stored in a relational database. One disadvantage of the SGNETsensors is that they can only monitor a few IP addresses (currently four), whichlimits their view of the Internet threat. Currently only two sensors have been de-ployed across the globe. While the SGNET is limited to code injection detection,it has the appealing feature of not depending on manually crafted static responsesto retrieve the payload of the attack, but rather depends on valid responses thatwere extracted from real servers. There has not been any analysis of data collected

2.4. Traffic Analysis Techniques 17

��

SGNET Sensor 1

Internet

SGNET Sensors

SGNET Sensor 2

SGNET Sensor 3

Internet

Internet

��

ShellCodeHandler 1

��

ShellCodeHandler 2

��

Sample Factory 1

��

SampleFactory 2

Gateway Service Provider

Figure 2.2: SGNET Architecture.

by the project.

2.4 Traffic Analysis Techniques

Characterizing or analyzing anomalous network traffic is considered a first step inincreasing our knowledge of attack threats and then protecting production net-works from them. Traffic analysis is a wide research field, which can be roughlydivided, based on the utilized technique, into three categories: data visualiza-tion, data mining, and statistical techniques. This section will review some of theprevious research in each category.

2.4.1 Data Visualization

Traffic visualization can be very helpful in assisting administrators in making ef-fective decisions. Complex attack patterns can be easily detected and interpretedby humans if they are represented properly in a visual form.

A number of researchers have investigated the applicability of using visualtechniques for the identification of attack tools, without relying on intrusion de-


tection system signatures and statistical anomalies. Abdullah et al. [47] used theParallel Coordinate Plots visualization technique [141], a technique for displayingmulti-dimensional data in one representation, to visualize captured packets in realtime in order to fingerprint popular attack tools from the top 75 network securitytools [61]. By focusing on complex patterns that are not easily automated by sys-tems, they found that visual representation allows traffic attacks to be more easilydetected and interpreted by humans.

Different systems have been designed to help visualize network traffic for secu-rity analysis [90, 26]. Krasser et al. [82] designed a network traffic visualizationsystem for real-time and forensic network data analysis. Their system supports areal-time monitoring mode and playback mode, and playback of previously cap-tured data for forensics. Archived honeypot captured traffic was used to evaluateand test the system. Zhang et al. [150] explored the use of plots of singular valuedecomposition (SVD) and scatter plots between eigenvectors in detecting trafficpatterns. Examples of traffic flow visualization for traffic anomaly detection in-clude FlowScan and NVisionIP [106, 89].

2.4.2 Data Mining

Data mining refers to the extraction of knowledge from a large amounts of data.This knowledge serves two main objectives [68]:

• Description: finding patterns that describe the current data, such as cluster-ing;

• Prediction: predicting the behavior of new data sets given the current dataset, such as classification.

Julisch [78] surveyed the most used data mining techniques in the field of intrusiondetection and found that four types of data mining techniques have been widelyused:

• Association rules, which search for interesting relationships among the itemsin large data sets, and the frequency of items occurring together. This tech-nique is widely used in the market basket analysis to study customers’ buyinghabits by finding associations among items being placed in the shopping bas-ket. This could help decision makers and analysts in finding sets of productsthat are frequently bought together and develop marketing strategies. From

2.4. Traffic Analysis Techniques 19

the presence of certain items in the shopping basket, one can infer with highprobability the presence of other items.

• Frequent episode rules, which are similar to association rules, but take recordorder into account;

• Classification, which is a process of learning from given data to build classi-fication models for each class of data based on features in the data. Then,these classification models will be used to predict new data classes;

• Clustering, which is a method of partitioning data into groups (clusters) forthe purpose of data simplification.

A number of researchers have applied data mining techniques to problems in in-trusion detection. Lee [91] used data mining techniques to discover consistent anduseful patterns of system features that describe program and user behaviors. Theyused the set of system features to develop classifiers that can recognize anomaliesand known intrusions. Two data mining algorithms were implemented, associationrules and frequent episode algorithms. The hierarchical tree classification cluster-ing technique has also been used to eliminate intrusion detection false alarms andidentify the root causes of attacks [79].

2.4.3 Statistical Techniques

Statistical analysis techniques have been widely used for characterizing and clas-sifying network traffic and for detecting attack patterns. The basic concept ofstatistical technique, in detecting anomalies, is to build a profile of normal be-haviors and then measure large deviations from the normal profile. Deviationsfrom the normal profile are tested against a predefined threshold value, whereanomalous behaviors are flagged once these deviations exceed the threshold.

Nong et al. [145] presented a host-based anomaly detection technique basedon a chi-square test (X2), a statistical significance test. A system profile wasbuilt from events of normal system audit data and the upper limit threshold wasestimated from the empirical distribution of the normal event data using 3-sigma,the mean plus 3 times the standard deviation. New events were tested against thenormal system profile, where large deviations were flagged as intrusions. Detailsof applying the chi-square test (X2) in the identification of intrusions is detailedin [67].


The use of a k-nearest neighbor (KNN) classifier for detecting intrusive programbehavior was presented by Liao et al. [95]. A program behavior vector is builtfrom frequencies of system calls and and the KNN classifier was used to categorizea new program behavior into either normal or intrusive, based on its distance fromthe previous k normal profile vectors, using a threshold value. Babara et al. [32]proposed the use of a Naive Bayes classifier, built from a training data set, toreduce the number of false alarms of ADAM [31], an anomaly detection system.

Change point detection, a cumulative sum (CUSUM) algorithm, has been usedto monitor statistical properties of network features to detect abrupt changes,i.e., deviation from the normal behavior, as being a result of anomalous traffic orattack. Wang et al. [138] proposed the use of sequential change point detection fordetecting TCP SYN flooding attacks. Attacks are detected as a violation of thenormal behavior of the TCP SYN-FIN flags, or an abrupt change in the differencebetween numbers of SYN packets to the number of FIN packets.

Ahmed et al. [28] proposed a technique for detecting anomalous activities ofa Darknet, a class C address block, using sliding windows and a non-parametriccumulative sum. In the context of worm detection, the change point detectiontechnique has been used to detect Internet worms [44]. Yan et al. [143] used changepoint detection to detect two classes of worms that target Internet messagingsystems through monitoring surges in file transfer requests or URL-embedded chatmessages.

Feinstein et al. [52] proposed the use of chi-square and entropy statistics indetecting distributed denial of service attacks (DDoS). Application of the expo-nentially weighted moving average (EWMA) to intrusion detection was exploredby Ye et al.[147]. Finally, Barford et al. [34] proposed the use of wavelet analysis,a signal processing techniques, in detecting network traffic anomalies.

2.5 Honeypots

The first use of the honeypot concept was by Cliff Stoll in his book “The Cuckoo’sEgg” [129]. This book describes the author’s experience, over a ten-month periodin 1986, with an attacker who succeeded in compromising his system. Whenthe attack was discovered, the attacker was allowed to stay further, while beingmonitored, in order to learn more of his tactics, interest and identity. Stoll used aproduction system in the process of creating the lure that was used to study the

2.5. Honeypots 21

attackers’ activities, unlike current honeypots which are decoy computers that runno legitimate services.

Another early use of the honeypot concept was described by Bill Cheswick [45].Cheswick discussed his several months’ experience with an attacker who broke intohis lure system, which had been initially built with several vulnerable services forthe purpose of monitoring threats to his system. The paper is considered the firsttechnical work in building and controlling a lure system, or what would be latercalled a honeypot.

The term “honeypot” was first introduced by Lance Spitzner [129]. Spitznerdefines a honeypot as a security resource whose value lies in being probed, attacked,or compromised. Provos et al. [117] defines a honeypot as a closely monitoredcomputing resource that one wants to be probed, attacked, or compromised. Thesedefinitions of a honeypot imply that the honeypot can be of any computer resourcetype, such as a firewall, a web server, or even an entire site. Other properties of ahoneypot, implied by its definition, include that it runs no real production servicesand any contact with it is considered potentially malicious. Also, traffic sent to orfrom a honeypot is considered either an attack or a result of the honeypot beingcompromised.

Figure 2.3 shows an example of a virtual honeypot setup that emulates two op-erating systems, a Windows 2000 server and a Linux server. The whole honeypotsetup, including the logging mechanism, is hosted using in a single Linux machine.The open source daemon Honeyd [116] is used to emulate the two operating sys-tems.

Honeypots are valuable security resources that are widely known and used.Several honeypot characteristics have contributed to their popularity in the se-curity community. The most appealing characteristics of honeypots are the lowrates of false positives and false negatives in their collected data. The low noise na-ture of a honeypot’s collected traffic results from its design concept of running noproduction services therefore all traffic is considered suspicious. Notable featuresinclude:

• honeypots collect small volumes of higher value traffic;

• honeypots are capable of observing previously unknown attacks;

• honeypots detect and capture all attackers’ activities including encryptedtraffic and commands; and


Virtual Honeypot s(Honeyd)

Host Machine

Linux

Windows 2000 Server

Traffic Logger(Tcpdump)

Internet

Router

Host Machine

xx.xx.xx.02

xx.xx.xx.03

xx.xx.xx.01

Figure 2.3: An example of a virtual honeypot setup that emulates two operatingsystems.

• honeypots require minimal resources and can be deployed on surplus ma-chines.

There are several types of honeypots, which can be grouped into four broad cate-gories [129, 117] based on:

• their level of interaction (low- and high-interaction honeypots);

• their intended use (production and research honeypots);

• their hardware deployment type (physical and virtual honeypots); and

• their attack role (server side and client side honeypots).

In the following subsections, we will discuss these categories in more detail. Inaddition, we will highlight some of the available honeypot technologies and solu-tions.

2.5. Honeypots 23

2.5.1 Low-interaction vs High-interaction Honeypots

A low-interaction honeypot is the simplest form of honeypot. It runs no realoperating system and offers an emulated network stack with limited or no serviceinteractions. Some of the advantages of using low-interaction honeypots are theirease of deployment and their low level of risk of being compromised by attackers.The main disadvantages of low-interaction honeypots are the limited amount ofinformation in their collected traffic and the ability of attackers to detect theirpresence.

An example of a low-interaction honeypot is a port listener such as Netcat [5].The Netcat command:

NC –v –l –p 445 > port445.log

opens a listener on TCP port 445 and accepts connections where all activities arelogged into the file “port445.log”. More sophisticated examples of low-interactionhoneypots include Honeyd [116] and LaBrea [97]. Honeyd is capable of emulatingdifferent operating systems (OSs), at the same time, and supports emulation scriptsof basic protocol behaviors such as FTP (TCP port 21) and SMTP (TCP port25).

In contrast to low-interaction honeypots, high-interaction honeypots are fullsystems with real operating systems and real applications. The main advantagesof using high-interaction honeypots are the rich information collected from theirattack traffic and the difficulty for attackers of detecting their presence. On theother hand, high-interaction honeypots are complex to deploy, overwhelm admin-istrators with vast amount of collected data, and introduce high risk to networks,in the case that they are compromised by attackers. Examples of high-interactionhoneypots include Generation II honeynets [70] and Argos [1].

While high-interaction honeypots provide more information for studying at-tackers’ activities through their full system functionality, they do not scale verywell for large-scale deployments in terms of hardware and software requirementsand high cost of maintenance. In contrast, low-interaction honeypots scale verywell, such that thousands of low-interaction honeypots can be run in parallel usinga single machine. However, low-interaction honeypots suffer from the limited (orabsent) support for emulation scripts. Detailed discussion of this topic is presentedin Section 2.5.5.


2.5.1.1 Honeyd

Honeyd, a honeypot daemon, is a low-interaction honeypot that was developedin 2002 by Niels Provos of the University of Michigan [116]. Honeyd is an opensource distribution that was originally designed to run on UNIX systems, and laterwas ported to the Windows environment. Honeyd is based on the open sourcepacket-capturing library, libpcap, and libdnet. It can detect and log connectionsto any TCP or UDP port and monitor up to 60,000 victim IP addresses at thesame time. When an attacker tries to connect to a non-existent IP address ofa computer system, Honeyd assumes the identity of the non-existent system andreplies to the attacker’s connection attempts.

Honeyd can emulate different operating systems at the same time on boththe application and IP stack levels. In emulating specific operating system (OS)TCP/IP stacks, it relies on the database files of Nmap and Xprobe2, which arethe most common tools for fingerprinting OSs by hackers, and for manipulatingand creating traffic. Because these tools specialize in fingerprinting, Honeyd canfool many attackers.

One of Honeyd’s most flexible features is the ability to add emulation scriptsthat mimic application and network services, either by creating custom scripts, us-ing any scripting language, or by downloading ready-made scripts from the Honeydproject web page. Ready-made scripts include IIS emulators, FTP, POP, SMTP,and telnet. Moreover, as Honeyd relies on the work of other open source tools inemulating OS TCP/IP stacks, including Nmap [62] and Xprobe2 [144], Honeydcan be updated by refreshing the fingerprinting databases. Limitations of Honeydinclude the difficulty of writing programs that completely emulate the behaviorsand vulnerabilities of network services.

2.5.2 Production vs Research Honeypots

Honeypots can be divided further, based on their intended use, into productionand research honeypots. Production honeypots are used by many organizationsto protect their production services [70]. Production honeypots are usually de-ployed to mirror some or all of an organization’s production services in order tostudy attackers’ techniques and the tools that they use against the organizations’real networks, to expose unknown vulnerabilities, and to assess security measures.Moreover, by analyzing data collected by production honeypots, organizations can

2.5. Honeypots 25

build better systems, easily assess damage to compromised systems, and collectforensic evidence that is not mixed with production traffic. An example of a pro-duction honeypot is Honeynet [115].

Research honeypots are usually deployed by universities and research centers tocollect information on threats. This information is then used for a variety of pur-poses, including studying attackers’ motivations and tools, and researching bettertechniques to analyze honeypot traffic. The Leurré.com project is an example ofa research honeypot, see Section 2.3.

2.5.2.1 Honeynets

Honeynets are high-interaction honeypots that were developed by the Honeynetproject [70, 115]. The concept behind the Honeynet is to build a complete networkof production systems where all activity is controlled, captured, and analyzed. TheHoneynet controls the attacker’s activity using a Honeywall gateway. This gatewayallows inbound traffic to the victim systems, but controls the outbound traffic us-ing intrusion prevention technologies and a connection limiting mechanism. ThreeHoneynets currently exist: Generation I, Generation II, and Generation III Hon-eynets. Generation I was developed in 1999 to capture beginner-level attackeractivities with limited data control. The Generation II Honeynet is a derivativeof Generation I, and was developed in 2002 to address several weaknesses of theGeneration I Honeynet, to improve data control, and to make it difficult to fin-gerprint, or determine the type of operating system used. Finally, Generation IIIHoneynet was released in 2004 with further refinements.

2.5.3 Physical vs Virtual Honeypots

A physical honeypot is a single machine running a real OS and real services,where the honeypot is connected to a network and is accessible through a singleIP address. Physical honeypots are always associated with the concept of highinteraction. However, physical honeypots are less practical in real network envi-ronments due to the limited view of their single IP address and the higher costof maintaining a farm of physical honeypots. Honeynets are examples of physicalhoneypots.

In contrast, virtual honeypots are more cost effective in monitoring large IPaddress spaces and in emulating different operating systems at the same time.Virtual honeypots are usually implemented using a single physical machine that


hosts several virtual honeypots. User Mode Linux (UML) [21] is a well-known toolto deploy virtual honeypots in Unix environments. The commercial tool VMware[22] allows more flexibility by running different operating systems at the same time.An example of a virtual honeypot is Argos [1]. Other examples implemented usingVMware include Collapsar [142], Potemkin [137], and HoneyStat [50].

2.5.3.1 Argos

Argos [1] presents a new method of deploying virtual honeypots, in which theemulating host monitors and detects attacks against the emulated guests, thehoneypots. Argos was designed based on the open source emulator QEMU [17].The QEMU capability of running multiple operating systems was extended byArgos to detect attacks targeting the emulated guest, without any modificationof the guest operating systems, through dynamic taint analysis. In dynamic taintanalysis, network data from an untrusted source is tagged and its propagationand execution is tracked. When the tainted data execution leads to an unexpectedsystem behavior, such as a buffer overflow attack, Argos identifies and prevents theusage of the tainted data. It dumps the memory block, along with tagged data andsome extra information for further analysis of the vulnerability. The usage of Argosto detect new attacks is different from traditional honeypots as it uses dynamictaint analysis of external traffic targeting the emulated hosts. Furthermore, theemulated hosts’ IP addresses are advertised.

2.5.4 Server Side vs Client Side Honeypots

Conventional honeypots are server side honeypots, which are set up to lure at-tackers involved in malicious activities. These honeypots are passive by designand do not initiate any traffic unless they are compromised. An example of aserver side honeypot is the low-interaction honeypot, Honeyd [116], see Section2.5.1.1. Server side honeypots have proved to be useful in detecting new exploits,collecting malware, and enriching research of threat analysis.

A new trend emerging is the active or client-side honeypot, in response to client-side attacks [127]. Client-side attacks represent attacks that target vulnerableclient applications, such as a web browser, when these applications interact withmalicious servers. The aim of client-side honeypots is to search and detect thesemalicious servers. An example of a client-side honeypot is Strider HoneyMonkey

2.5. Honeypots 27

[140], a Microsoft project to detect and analyze web sites that host malicious codesthat exploit web browsers. Honeyclient [11], and HoneyC [6] are further examples.

2.5.4.1 HoneyMonkey

HoneyMonkey [140] is a client-side honeypot developed by Microsoft research fordetecting malicious web content that exploits vulnerabilities in Internet Explorer.HoneyMonkey works by using monkey programs that configure different high-interaction honeypots with different configurations and patches, running undervirtual machines, to mimic humans when they browse the Internet. HoneyMon-key searches Internet web sites looking for malicious content that exploits browservulnerabilities. When web content succeeds in exploiting the browser, HoneyMon-key generates a detailed report of the vulnerability, including the web site URL,windows registry, and a log of the infected virtual machine.

2.5.5 Improving Honeypots While Lowering Their Risks

One major drawback of low-interaction honeypots is their inability to interact withthe attacker to the level needed to reveal the characteristics of the attack. In con-trast, high-interaction honeypots have this capability, but carry an increased riskof being fully compromised. A high-interaction honeypot needs constant monitor-ing to decrease the legal risks associated with it either being used against othernetworks or exposing the local production system to attacks. Increasing honeypotinteractivity, while reducing the risk associated with its deployment, is a usefulbut challenging task. Early attempts used scripts to mimic services [116]. How-ever, this method proved time consuming and impractical in the case of complexprotocols, as it requires full understanding of the required protocol.

Several proposals for systems have emerged to address these challenges and toleverage the level of honeypot interactivity while at the same time keep their risklow [49, 93]. ScriptGen [93] is a system that extends the capability of the well-known low-interaction honeypot, Honeyd, for the automatic generation of emula-tion scripts of protocol behaviors, without requiring prior knowledge. The systemstarts with a limited emulation capability, which is extracted through training thesystem via real interaction with a real server, using a high-interaction honeypot.A state machine is then built on these data in an incremental way to react toattackers’ requests. When the request is not recorded by the state machine of theScriptGen, the whole conversation is played against a high-interaction honeypot


to extract the required response. Then, the emulation script is refined with thisnew data in order to respond to similar requests in the future. The use of high-interaction honeypots is limited to cases where the response is not present in thecurrent state machine of Scriptgen.

Another challenge with honeypots is to keep their deployment hidden so asto preserve their value and maximize their role in tracking attackers. When anattacker detects that he is dealing with a honeypot, he will generally try to avoidit or feed it with bogus data. Yegneswaran et al. [148] described a technique fordefending against honeynet mapping through changing the locations of honeypotsrandomly in the address space whenever the number of probes exceed a predefinedthreshold.

2.5.6 Honeypot Traffic Anomalies

Honeypots are passive machines which, by definition, run no production services,and their deployments are not advertised. These properties make any traffic target-ing honeypots suspicious. Analysis of honeypot traffic reveals that honeypots maycollect traffic related to other Internet phenomena such as misconfigured serversand backscatter, and also some legitimate traffic such as vulnerability scans fromlocal administrators.

While it is easy to filter out the backscatter [100] and detect local administra-tion scans (as the scanning IPs are known), detecting and eliminating misconfig-uration traffic is very challenging [151]. Other categories of traffic types seen byhoneypots are considered malicious and fall under the broad categories of threatsfacing computers and networks that are connected to the Internet, such as denialof service attacks, scans, and worms.

Other properties of anomalous traffic collected from different IP address spaces,including honeypots, show that patterns and volumes of this traffic vary from onelocation to another [30]. These differences have been observed over all protocolsand services and over a specific protocol. Many factors can be attributed to thevariability in traffic volumes and patterns, which include the filtering policy, con-figuration of the monitored address space, propagation strategy of the maliciouscode, and limited global reachability due to either poor or absent routing.

2.5. Honeypots 29

2.5.7 Existing Honeypot Solutions

Honeypots are a relatively recent security technology, yet they have drawn theattention of significant numbers of researchers and network administrators. In-creased interest in honeypots has led to the development of a variety of honeypottechnologies. The following section reviews several existing honeypot-based solu-tions for countering different types of security threats.

2.5.7.1 Automatic Generation of IDS Signature

Honeycomb [83] is a honeypot system that automatically generates intrusion de-tection signatures for unknown attacks. Honeycomb is built as an extension to theopen source honeypot, Honeyd. Honeycomb integration with Honeyd enables itto see sent and received traffic and connection states of Honeyd. Honeycomb usespattern-detection techniques and packet-header conformance tests in order to gen-erate signatures for the two popular intrusion detection systems: Bro and Snort.Honeycomb is one of the few systems that utilizes the honeypot’s characteristicof collecting only malicious traffic and automates the current manual process ofanalyzing collected data. Honeycomb shares limitations inherited from Honeyd.

2.5.7.2 Worm Detection Systems

Computer worms are defined as “independent replicating and autonomous infec-tion agents, capable of seeking out new host systems and infecting them via thenetwork” [102]. Through exploiting worms, attackers can do massive damage tothe Internet by compromising a vast number of hosts. Such damage includes dis-tributed denial of service attacks (DDoS), and access and corruption of sensitivedata [130].

Honeypot solutions aimed at detecting worms include HoneyStat [50] andSweetBait [107]. HoneyStat [50] is a worm detection system for local networksusing Honeypots. It is implemented using a VMware GSX Server running virtualmachines of several operating systems, such as Windows and Linux. It representsthe third generation of honeypot deployments at Georgia Institute of Technology,which followed the deployment of Honeynets Gen I and Gen II.

HoneyStat was designed based on modeling worm infections in a honeypot. Itmonitors unused address spaces and generates three types of alerts: memory, diskand network alerts. These streams of alerts are automatically collected and statisti-


cally analyzed, using logistic regression, for detecting worm outbreaks. HoneyStatuses data from the local network only, which limits the amount of traffic that canbe observed by its nodes.

2.5.7.3 Malware Collection

Malware is malicious software for exploiting vulnerabilities in computer systems.Types of malware include viruses, worms, and trojan horses. There are severalhoneypot projects that collect malware such as Nepenthes [12], Honeytrap [103],and IBM Billy Goat [124].

Nepenthes [12] is an open source low-interaction honeypot for collecting mal-ware. As honeypots are the most effective tool in collecting malware, Nepentheswas developed specifically to fill a gap that existed in the honeypot technologyfor collecting automated malicious software. Nepenthes inherits the main charac-teristics of low-interaction honeypots in emulating thousands of honeypots at thesame time with low hardware requirements, and excels in its efficiency due to itsway of only emulating the vulnerable part of the services. Another appealing newfeature of Nepenthes as regards capturing malware is the flexibility of its emula-tion process. Nepenthes is able to decide the right configuration for the exploitto be successful during the run time, for example, whether Unix or Windows isrequired. Finally, the deployment and maintenance cost of Nepenthes is minimal,and it carries very low risk as all systems and services are emulated. However,the use of Nepenthes is limited to self-propagating malware that first scans forvulnerabilities then eventually attempts to exploit them.

2.6 Related Work

The previous sections have established the background for the thesis as a whole.This section reviews research outcomes from the Leurré.com project and providesother background and related material that is relevant to the work described inChapters 3 to 6. Research challenges in honeypot traffic analysis are also identified.

2.6.1 Research Outcomes from the Leurré.com Project

Various types of analysis have been carried out on honeypot traffic obtained fromthe Leurré.com project. The aims of this analysis were to characterize different

2.6. Related Work 31

Internet attack activities and to unveil useful attack patterns [109, 114, 108, 113,105].

Pouget et al. [112] applied association rule mining [27] to different features oflow-interaction honeypot traffic, with the port sequence of a ’large session’ as amain clustering feature. The aim of their study was to group traffic that sharessimilar activity fingerprints into clusters to find the root causes of attacks, ortools. In their research, each cluster is assumed to represent one attacking tool orits re-configuration.

The clustering of honeypot traffic was investigated further by Pouget et al.[110] and the notation of cliques was introduced to identify the inter-relationshipbetween clusters, that is, clusters that share strong similarities in one or moredimensions, such as targeted environments and origin of attacks.

The use of packet Inter-Arrival Times (IAT) for characterizing anomalous hon-eypot traffic was introduced by Zimmermann et al. [151]. The study was conductedon six months of honeypot traffic data. The usefulness of the IAT in characteriz-ing anomalous honeypot traffic was demonstrated through the discovery of severalanomalous activities, from different IP sources, that share similar IAT peak dis-tributions.

Thonnard et al. [134] proposed a framework for identifying attacks that sharesimilar patterns based on the selection of different traffic features. The model wasdemonstrated using time signatures to find temporally correlated attacks. Theframework utilizes the clique-based clustering algorithm to group pre-clusteredhoneypot traffic.

2.6.2 Application of Principal Component Analysis to In-ternet Traffic

Principal component analysis (PCA) is a statistical technique for reducing thedimensionality of data into a few uncorrelated variables that retain most of thevariation in the original data. These newly derived variables are called principalcomponents (PCs) and they can be used instead of the original variables. Thereduction in the number of variables, PCs, serves as a base for many data analysistechniques, which include data reduction, data visualization, and outlier detection[75, 74].

The applications of PCA to computer network traffic fall roughly into threecategories: detecting the latent structure of the traffic data, reducing the dimension


of the traffic data, and identifying anomalies. A number of researchers have usedprincipal component analysis to reduce the dimensionality of variables and todetect anomalous network traffic. The use of PCA to structure network trafficflows was introduced by Lakhina [87], whereby principal component analysis isused to decompose the structure of Origin-Destination flows, from two backbonenetworks into three main constituents, namely periodic trends, bursts and noise.

Labib et al. [86] utilized PCA in reducing the dimension of the traffic data andfor visualizing and identifying attacks. For detecting different types of attacks,the loadings of attack features of the retained PCs were compared to a predefinedthreshold and visualized using Bi-Plots. Bouzida et al. [42] presented a perfor-mance study of two machine learning algorithms, namely nearest neighbors anddecision trees, when used with traffic data with or without PCA. They found thatwhen PCA is applied to the KDD 99 data set, to reduce the dimension of the data,the learning speed improved while accuracy remained the same.

Terrell et al. [133] used principal component analysis on features of the aggre-gated network traffic of a link connecting a university campus to the Internet inorder to detect anomalous traffic. Sastry et al. [126] proposed the use of singularvalue decomposition and wavelet transform for detecting anomalies in self-similarnetwork traffic data. Wang et al. [139] proposed an anomaly intrusion detectionmodel for monitoring network behaviors based on principal component analysis.The model utilizes PCA in reducing the dimensions of the historical data andin building the normal profile, as represented by the first few components thataccount for the most variation in the data. An anomaly is flagged when the dis-tance between the new observation and the normal profile exceeds a predefinedthreshold.

Ye et al. [146] studied the performance of Hotelling’s test, a multivariatestatistical process control technique that is equivalent to retaining all componentsin the PCA model, against a chi-squared distance test for host-based anomalydetection. The study was conducted on two data sets with different sizes and itwas concluded that the chi-squared test scales well for real-time detection, whilethe Hotelling’s test detects the counter-relationships or changes in structure of thevariables.

Shyu et al. [128] proposed an anomaly detection scheme based on robust prin-cipal component analysis. Two classifiers were implemented to detect anomalies:one was based on the major components that capture most of the variation in the


data; and the second was based on the minor components, or residuals. A newobservation is considered to be an outlier or anomalous when the sum of squares ofthe weighted principal components exceeds a threshold in any of the two classifiers.

Lakhina et al.[88] applied the principal component analysis technique to Origin-Destination (OD) flow traffic counts of link data bytes. The network traffic wasisolated into normal and anomalous spaces by projecting the data onto the resultingPCs one at a time, ordered from high to low. PCs are moved to the normal spaceas long as a predefined threshold (3-sigma) is not exceeded. When the thresholdis exceeded, then the PC and the subsequent PCs are moved to the anomalousspace. New OD flow traffic is projected into the anomalous space and an anomalyis flagged if the value of the square prediction error, or Q-statistic, exceeds apredefined limit. The subspace method was extended by the same authors [25] todetect anomalies in multivariate time series of OD flow traffic with three features(number of bytes, number of packets, and number of flows). Their new model testsanomalies in both the normal space and in the anomalous space.

Guangzhi et al. [119] proposed a real-time detection system based on multivari-ate statistics. The normal profile of the network system was built from attributesof the network hierarchy using Hotelling’s T 2 statistic. New traffic triggers analarm if its distance from the normal region exceeds the predefined upper andlower control limits.

Terrell et al. [133] have used singular value decomposition, SVD, a differentmethod for extracting principal components, to detect attacks in near real-timeInternet traffic, collected every hour from a university main link. Attack trafficis aggregated into bins of different sizes and features are extracted from thesebins. Attack detection is achieved by measuring the weighted sum of squares ofleast significant component scores against a predetermined threshold value thatis extracted from a gamma distribution, with normality assumption of networktraffic.

2.6.3 Research Challenges

As discussed in Section 2.2, the amount of traffic collected by monitoring methodsis tremendous, which makes it very difficult to handle, store and analyze. Adding tothis challenge is the dynamic nature of anomalous traffic which changes frequently,due to factors including changes in software configurations and deployments of newprotocols and services.


While honeypots excel in collecting smaller traffic volumes, when comparedto other traffic collection techniques, extracting useful anomalous patterns or de-tecting new attacks in honeypot traffic necessitates the research of better analysistechniques to summarize and process this traffic. This study has identified severalresearch challenges that need to be addressed for efficient detection of anomaloushoneypot traffic:

• The proposed technique should be capable of extracting useful attack pat-terns from traffic gathered by low-interaction honeypots, which has a lowlevel of detail;

• The proposed technique must have the capacity to handle and summarizetraffic data sets with multiple features;

• The proposed technique must have low computational requirements and besuitable for real-time application;

• The proposed technique should adapt to the dynamic nature of Internet at-tacks and capture new trends with little or no human intervention or tuning.

Two aspects of traffic analysis need to be considered to address the research chal-lenges: traffic representation and traffic analysis techniques. Broadly speaking,traffic features are extracted either from packet level or flow level data. Whilepacket level analysis provides a wealth of information on all aspects of attacks,including access to payloads [51], it suffers from performance issues when handlinglarge amounts of data. In contrast, network flows, such as NetFlow [132] andArgus [2], provide an aggregated summary of connection information [63]. Trafficflows provide enough information to characterize a variety of Internet anomalies[33], excel in their performance for real-time analysis, and are able to aggregatelarge amounts of traffic data in a summarized form.

In our search for optimal analysis techniques that address the previously men-tioned research challenges, we have selected data reduction techniques, a subset ofmultivariate statistical techniques for analyzing honeypot traffic. Data reductionrefers to the process of reducing the number of variables in data sets while at thesame time retaining enough information for the intended type of analysis. Ex-amples of linear dimension reduction techniques are principal component analysis,projection pursuit, factor analysis, and independent component analysis.


Principal component analysis (PCA), sometimes known as singular value de-composition (SVD), is the best known and most widely used linear dimensionreduction technique [54, 36]. PCA is easy to implement and has low computa-tional requirements. The basic idea of PCA is to reduce the dimensionality ofa data set into a few uncorrelated variables or principal components (PC). Theresulting principal components are a linear combination of the original variablesand retain most of the variance in the original data. Principal component analysis(PCA) is explained in detail in Chapter 4.

The aim of the Leurré.com project’s clustering approach is to find the rootcauses of attacks, where each cluster is assumed to represent one attacking toolor its re-configuration. The aim of this research, in using PCA, is different in thefollowing ways. It seeks to:

• determine factors that contribute to the variations in the traffic patterns,and

• detect new attacks in real-time applications.

Previous applications of PCA to network traffic treat the network traffic as eithernormal or anomalous, and the detection model is built on what is believed normal.The notion of normal and anomalous does not apply in honeypot traffic whereall traffic is suspicious. Thus, our technique is fundamentally different from theprevious applications of PCA techniques in the following ways. Firstly, trafficfeatures are extracted from aggregated flows, where standard flows from a single IPaddress are grouped together to provide sufficient information on attack patterns.Secondly, PCA is used to build a model of existing attacks that have been seenin the past, rather than a model of normal behaviors. Any large deviation fromthe attack model is considered either a new attack vector or an attack that is notpresent in the model. Finally, the use of recursive principal component analysis(RPCA) is introduced in order to design a real-time adaptive detection modelthat both captures new changes in anomalous Internet traffic and updates itsparameters automatically.

2.6.3.1 Honeypot Traffic Validation

Broadly speaking, there are two methods for validating an attack detection model.The first validation method consists of manually labeling attacks in a data setand then testing the model performance, false positive and false negative criteria,


against this labeled data. The second validation approach is based on testing themodel performance against synthetically crafted attacks that are manually injectedinto the data set.

Applying these classical IDS validation methods to a low-interaction honeypotdetection model is not applicable, as the notions of normal and abnormal do notapply. The collected traffic which is considered suspicious per se, and the low levelof detail available, makes labelling and categorizing the collected traffic difficultor impossible. To address these challenges in the absence of a well establishedmethodology to validate a honeypot traffic detection model, manual inspection oftraffic flagged by the detection method is used. Although manual inspection oftraffic is expensive, it allows a better understanding of the nature, significance, andclasses of the flagged traffic. Manual inspection of traffic has been used previouslyto validate and inspect low-interaction honeypot traffic [111, 151, 134].

2.7 Summary

In this chapter, current research in monitoring anomalous network traffic has beenpresented, with emphasis on honeypots as essential tools for gathering useful in-formation on a variety of malicious activities. Honeypots come in three basicvarieties, fully dedicated real systems like Honeynets, emulated service honeypotssuch as Honeyd, and virtual honeypots such as Argos. Low-interaction honeypotspose minimal risks to the network and require low administration efforts whencompared with high-interaction honeypots, making them ideal for research.

Running honeypots would result in collecting a huge amount of traffic data.Extracting useful information and knowledge from this data requires efficient tech-niques for discovering hidden attack patterns and detecting new attacks.

Three types of data analysis techniques have been widely used in traffic anal-ysis, including data mining, statistical analysis, and visualizations. Research inapplying these techniques to network traffic has been reviewed. In addition, re-search challenges in the field of honeypot traffic analysis have been identified.

The next chapter presents this study’s first contribution in analyzing low-interaction honeypot traffic, using data from the Leurré.com project, and de-tails the methodology for improving the Leurré.com clusters, by grouping clustersthat share similar types of activities based on packet inter-arrival time distribu-tions. The use of principal component analysis to characterize and visualize low-

2.7. Summary 37

interaction honeypot traffic is described in Chapter 4. Detecting new attacks inlow-interaction honeypot traffic through the use of principal component residualspace and square prediction error (SPE) statistics is detailed in Chapter 5. Finally,Chapter 6 proposes an adaptive detection model that captures changes in Internettraffic and updates its parameters automatically.

Chapter 3

Traffic Analysis Using PacketInter-arrival Times

The Leurré.com project is a world-wide deployment of identical low-interactionhoneypot platforms. This chapter gives a brief introduction to the Leurré.comproject setup and its methodology in collecting and processing honeypot traffic,with a focus on Leurré.com’s attack clusters. Then, a new methodology is proposedthat overcomes a limitation in Leurré.com’s clustering technique for producinglarge number of clusters that share some similarities. The new method is based ongrouping clusters that share similar packet inter-arrival time (IAT) distributions.

This chapter is organized as follows. Section 3.1 provides a brief overview ofLeurré.com’s terminologies, platform architecture, and data manipulation. Section3.2 provides a preliminary investigation of packet inter-arrival time. Section 3.3discusses our methodology in analyzing pre-clustered honeypot traffic using packetinter-arrival times. Experimental results are presented in Section 3.4. Finally,Section 3.5 summarizes the chapter.

The work described in this chapter is a joint project with the Leurré.com teamand has led to the following publication:

S. Almotairi, A. Clark, G. Mohay, O. Thonnard, M. Dacier, C. Leita, V. Pham,J. Zimmermann, “Extracting Inter-arrival Time Based Behaviour from HoneypotTraffic using Cliques”, in the Proceedings of the 5th Australian Digital ForensicsConference, Perth, Australia, Dec 2007.

39

40 Chapter 3. Traffic Analysis Using Packet Inter-arrival Times

3.1 Information Source

All analyses in this thesis use data that comes directly from the Leurré.com project,a world wide deployment of low interaction-honeypots. This section gives a briefoverview of the Leurré.com’s terminologies, the platform architecture, and datamanipulation.

3.1.1 The Leurré.com Honeypot Platform

The Leurré.com project is a world-wide deployment of identical low-interactionhoneypot platforms. Each platform, which is based on the open source low-interaction Honeyd [116], runs on a single machine and emulates three operatingsystems at the same time: Windows 2000 Professional, Windows 2000 Server andLinux RedHat 7.3. The Windows virtual hosts have the following open servicesto provide interactions with attackers: TCP 21, 23, 80, 139, and 445; and UDP37. The UNIX virtual honeypot has the following open TCP ports: 21, 22, 25, 80,111, 514, 515, and 8080. The platform architecture is presented in Figure 3.1.

Virtual Honeypots (Honeyd)

Honeypot Windows 2000 Server

Traffic Logger(Tcpdump)

Internet

Router

Host Machine (Unix)

xx.xx.xx.01

Linux

TCP: 21,22,25, 80,11,514,515,8080

Windows 2000 Professional

TCP: 21,23,80,139,445UDP: 137

TCP: 21,23,80,139,445UDP: 137

IP: xx.xx.xx.02

IP: xx.xx.xx.03

IP: xx.xx.xx.04

Figure 3.1: Leurré.com honeypot platform architecture.

3.1. Information Source 41

3.1.2 Data Manipulation

The data manipulation of Leurré.com’s honeypot traffic is done offline. On adaily basis, traffic logs from all platforms are transferred to a centralized machinewhere they are processed, enriched with external data, and inserted into relationaldatabase tables. In this section, some of the data manipulation tasks and termi-nologies are overviewed.

3.1.2.1 Port Sequences

The port sequence is the main feature of Leurré.com’s clustering algorithm. Aport sequence is a list of targeted honeypot ports generated by a single IP addressduring the attack period. An example of a port sequence of an attack generatedby an IP address that targeted TCP port 139, 445, 998, 139, 445, UDP port 137and ICMP traffic, would be: {T139,T445,T998,U137,I}. Figure 3.2 provides anillustration of the port sequence of an attack.

Windows 2000 Prof .

Windows 2000 Serv .

Unix

Honeypot Platform

Honeyd

Attacker Honeypot Host

(Unix Workstation)

Internet

ICMP

TCP Port 139

TCP port 445

TCP port 998

TCP port 139

TCP Port 445

UDP port 137

Attack Vector

Port Sequence: I,T139,T445,T998,U137

Figure 3.2: Illustration of the port sequence of an attack.

3.1.2.2 Large vs. Tiny Sessions

Two notions of sessions are currently used in the Leurré.com project: large andtiny sessions. While a large session is the set of all activities and packets exchangedby one source against one platform, a tiny session is a subset of the large session


for activities of one source against a single virtual host (every platform runs threevirtual hosts). Accordingly, a large session consists of three tiny sessions. Thelarge session is terminated when the next packet arrives after 25 hours from thesame source.

3.1.2.3 Traffic Clusters

Large sessions that share similar traffic fingerprints are grouped together accordingto a hierarchy-based clustering approach [111]. The aim of the clustering algorithmis to discriminate between different attacking activities based on their distinct clus-ter signatures, where each cluster represents an attacking tool. Current featuresthat are utilized by the clustering algorithm include:

• the number of targeted virtual machines on the honeypot platform;

• the sequence of ports;

• the number of packets sent by attacking source;

• the number of packets sent to each honeypot virtual machine;

• the duration of attack; and

• the ordering of attacks against the virtual machines.

A final refinement step of the incremental clustering approach is through payloadvalidation. Payloads sent by an attacker within a large session are ordered, ac-cording to their arrivals, and concatenated. Then, the Levenshtein-based distance[29] is used to check the consistency of clusters for a possible further split, if attackpayloads are available.

3.2 Preliminary Investigation of Packet Inter-arrival Times

Packet inter-arrival times (IATs) are the time intervals between packets arrivingfrom the same attacker’s IP address to a single honeypot machine, see Figure3.3. The IAT has been widely used in network traffic analysis to infer denial ofservice attacks (DOS) [72], in studying network congestion [136], and in studyingunsolicited Internet traffic [151].

3.2. Preliminary Investigation of Packet Inter-arrival Times 43

While certain features of traffic excel in characterizing different types of attackactivities, this study investigates the application of IAT as a meaningful and dis-criminatory feature in identifying traffic that share similarities, i.e. caused by thesame attacking tools or originated from the same sources, but which have been putin different attack clusters. Other clusters’ features, such as the geographical loca-tion of the attacker or location of targeted platform, might be relevant for studyingother attack phenomena, such as the popularity of certain tools with certain IPs orthe observation of specific tools being used against particular environments. Themain focus of this work is the identification of repeated use of attack tools thatexhibit similar packet inter-arrival time distributions. However, this methodologycan be applied to any type of analysis.

Attacker Honeypot Host

Internet

P1: 7:30:00

P2: 7:30:02

P3: 7:45:02

P4: 8:30:20

P5: 8:30:50

P6: 12:15:50

Packet: Time of Attack

2

P1 P2 P3

900 2718

P4 P5

30

IATSeconds

P6

13450

Packet Arrival Times

IAT Vector: {2,900,2718,30,13450}

Figure 3.3: Illustration of packet inter-arrival times.

3.2.0.4 Prevalence of IATs in Honeypot Traffic

We have carried out a simple frequency analysis of IATs of honeypot traffic ob-served by Leurré.com for the period of time from January, 2003 until June, 2005.Traffic data was collected without the notion of large and tiny sessions, whichare used to classify sessions, and all packets from one source to one destinationwere arranged in one vector. Figure 3.4 shows a global distribution of IATs acrossall platforms for IATs that are less than 100000 seconds. In contrast, Figure 3.5provides a zoom into the IAT values that range from 0 to 10000 seconds.


Figure 3.4: A global distribution of all IAT distribution < 100000 seconds.

Figure 3.5: IAT distribution values that range from 0 to 10000 seconds.

The figures show the prevalence of IAT peaks as multiple spikes of variousheights, locations and spacings. Table 3.1 lists the top ten IAT peaks, sorted bynumber of packets in descending order.

As Table 3.1 shows, IAT peaks are caused by different IP addresses (distinct IP

3.3. Cluster Correlation Using Packet Inter-arrival Times 45

IAT Peak Distinct IPAddress

DistinctSource Id

Distinct HostId

Number ofPackets

300 1457 2220 82 345120900 781 1555 74 1393701800 462 761 60 188103600 1308 4586 74 18302600 1037 1459 79 1215028801 186 1079 18 73571200 675 969 70 42809754 59 184 32 281357601 63 439 11 16812400 339 508 55 1480

Table 3.1: Distinct sources and destinations of the top ten IATs.

addresses) that have repeatedly attacked the honeypot platforms (distinct sourceIds), and targeted different hosts (Distinct Host Ids). When the same attackingIP returns after 25 hours it is assigned a new source Id. While these attackersused different IP addresses and targeted different honeypot platforms, they havegenerated similar IAT fingerprints, in terms of IAT peaks.

3.3 Cluster Correlation Using Packet Inter-arrivalTimes

Preliminary analysis of the Leurré.com clusters showed that the clustering algo-rithm results in a large number of clusters where some of these clusters share com-mon attack features between them. Thus, this study was carried out on groupingclusters that share similar types of activities. The method in grouping clusterswas based on finding clusters that share similar packet inter-arrival time (IAT)distributions.

All traffic collected by the distributed platforms of the Leurré.com project wasclassified into clusters according to the clustering approach utilized by the project.The IAT distributions of the clusters were represented with a vector in whichevery element corresponded to the IAT frequency of a pre-defined bin (range oftime values). The ranges were chosen to be more fine-grained for the shorterIATs, and for certain peak values. The IAT ranges corresponding to bin values(per cluster) are listed in Table 3.2. The result was an IAT vector of 152 bins,


with the first bin groups IATs falling in the interval 0-3 seconds, and the last bincorresponds to IATs of 25 hours or more.

Bin Start Time (s) Stop Time (s) Comment

1 0:00:00 0:00:032 0:00:04 0:00:08

[ 5 second increments ]7 0:00:29 0:00:338 0:00:34 0:00:43

[ 10 second increments ]17 0:02:04 0:02:1318 0:02:14 0:02:43

[ 30 second increments ]43 0:14:44 0:14:5744 0:14:58 0:15:02 15 minute peak45 0:15:03 0:15:3246 0:15:33 0:29:5747 0:29:58 0:30:02 30 minute peak48 0:30:03 0:45:0249 0:45:03 0:59:5750 0:59:58 1:00:02 1 hour peak

[ 15 minute increments ]54 1:45:03 1:59:5755 1:59:58 2:00:02 2 hour peak





[ 15 minute increments ]152 25:00:03

Table 3.2: Bin values of IAT ranges.


3.3.1 Data set

The data set used in this study covers three months of traffic (March – May 2007)collected from all environments of the Leurré.com project. This was the mostrecent data available at the time this research was conducted. A three-monthperiod was chosen as it provided enough data to demonstrate the effectiveness ofthe proposed technique while being manageable in size. For each cluster, the IATfrequencies (from the tiny sessions which took place within that three month pe-riod) were extracted and the values in the corresponding vector bins (as describedabove) were incremented. Only those clusters which had at least one bin were con-sidered, after the 21st bin (the 22nd bin corresponds to around five minutes), witha count of more than 10. This means that clusters which do not have more than10 occurrences of a particular IAT value greater than five minutes were ignored.Indeed, small IAT values are less meaningful for this analysis because of networkartefacts such as congestion, packet loss, and transmission latency, for example.As might be expected, the vast majority of tiny sessions (and clusters) containpackets with only relatively small inter-arrival times. As a result, the cliquing al-gorithm focuses on differentiating the behavior exhibited by clusters which containlarge, regular IATs.

3.3.2 Measuring Similarities

Most pattern matching and data mining techniques are heavily based on the con-cept of similarity or closeness in grouping objects into clusters. One common wayof measuring similarity is through the use of distance measures (or dissimilarity).Consequently, when the similarity between two objects increases, the distance be-tween them decreases, and the two objects are considered similar when the distancebetween them becomes zero.

There are many types of distances, such as Canberra and Manhattan distances,but by far the most commonly used distance is the Euclidean distance. In its simpleform, the Euclidean distance, in two-dimensional space, represents a straight linebetween the two points P1 and P2, and is represented by a single positive numberthat measures the number of units between the two points. The Euclidean distancebetween the two points P1 = (x1, y1)and P2 = (x2, y2) becomes:

d(P1, P2) =√

(x1 − x2)2(y1 − y2)2 (3.1)


As the previous example shows, the Euclidean distance treats coordinatesequally and does not account for differences in variability contributed by eachcoordinate, which makes it very sensitive to the scale of variables. The limitationof the Euclidean distance and the nature of the application of clustering time seriesvectors of inter-arrival times (IATs) of traffic necessitates the search for an alter-native distance that fits the need in finding temporal similarities between vectorsof IATs.

Symbolic Aggregate Approximation (SAX) distance, which was introduced byLin et al. [96], is applied to IAT vectors with the aim of finding temporal simi-larities . The SAX representation of a time series is obtained by converting thetime values to Piecewise Aggregation Approximation (PAA) or dividing the signalinto equal segments and then calculating the mean value of each segment. Then,the PAA is transformed into symbols. Figure 3.6 illustrates the SAX technique ofconverting a time series into a word of 16 symbols (DDDDDCBABCDEDDEF).

0

3

2

1

4

-4

-3

-2

-1

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

A

E

D

C

B

F

Figure 3.6: A time series conversion using SAX.

To find the minimum distance between two time series Q and C of same lengthn, the time series are transformed into PAA representations Q and C of w symbols.Then, the MINDIST() function is given by:

MINDIST (Q, C) =√n

w

√√√√ w∑i=1

(dist(qi, ci))2 (3.2)

The dist() function returns the distance between two PAA symbols and can


be implemented using a table look-up for better computational efficiency. Detailsof the function implementation and source code can be found on the SAX homepage[18]. The similarity matrix M of size m × m is constructed using the SAXdistance that was described earlier, where m is the number of time series vectors.Given the two clusters i and j, the M(i,j) represents the similarity between thesetwo clusters in a symmetrical way, where M(i, j) = M(j, i) and the diagonal areequal to zeros.

3.3.3 Cliquing Algorithm

Due to the large quantity of data collected, it was necessary to rely on an auto-mated methodology that was able to extract relevant information about the attackprocesses. The correlative analysis relies on concepts from graph and matrix the-ory. In this context, a clique (also called a complete graph) is an induced subgraphof an (un)directed graph in which the vertices are fully connected. In this case,each node represents a cluster, while an edge between a pair of nodes representsa similarity measure between two clusters. Figure 3.7 provides an illustrative ex-ample of finding cliques of a graph.

1

2 3

4 5

6

60

50

7025

30

40

5

10

1510

5

510

10

15

3

5

6

25

30

40

1

2

4

60

50

70

Clique 1

Clique 2

Figure 3.7: An example of finding cliques.


Determining the largest clique in a graph is often called ’the maximal cliqueproblem’ and it is a classical graph theoretical NP-complete problem [43]. Al-though numerous exact algorithms [84, 85, 40] and approximate methods [41, 104]have been proposed to solve this problem, this study addressed the computationalcomplexity of the clique problem by applying our own heuristics to generate setsof cliques very efficiently. While this technique is relatively straightforward, itpossesses two significant features. Firstly, it is able to deliver very coherent resultswith respect to the analyzed similarities. Secondly, regarding the computationalspeed, this technique out-performs other algorithms by several orders of magni-tude.

For example, we applied the approximate method proposed by [104], whichconsists of iteratively extracting dominant sets of maximally similar nodes froma similarity matrix. On our data set, the total computation was very expensive(several hours) whereas the custom cliquing algorithm only took a few minutes togenerate the same cliques of clusters with the same data set. On the other hand,our heuristic imposed a constraint on the similarity measure, namely that it hasto be transitive. With this restriction, it was sufficient to compute the correlationbetween one specific node and all other nodes in order to find a maximal cliqueof similar nodes. This transitive property was achieved by carefully setting aglobal threshold on the measurement of similarities between clusters. Clearly, thisalgorithm takes advantage of the already created cliques to progressively decreasethe search space; so in the average case the algorithmic complexity will be lessthan O(n2), and a complexity order of O(n.log(n)) would typically be expected.The clique algorithm is detailed in Figure 3.8.

3.4 Experimental Results

In this section, our analysis of the IAT-based cliques obtained using the aboveapproach when applied to the Leurré.com data set is described. A data set coveringthree months of traffic (from March to May 2007) collected from the Leurré.comenvironment was considered. For the sake of conciseness, only those clusters whichhad at least one bin, after the 21st bin (the 22nd bin corresponds to around fiveminutes), with a count of more than 10 were considered. This means that clusterswhich did not have more than 10 occurrences of at least one IAT value greaterthan five minutes were ignored. After this filtering, 1475 vectors were obtained,

3.4. Experimental Results 51

Input: Cn IAT vectors of clusters, Q Threshold valueOutput: Cliques1 i = 0 % clique index2 Cliques = % start with an empty list of vectors3 while (C is not empty) do {4 Move the first vector in the list C to V , V = C(0)5 Remove V from the vector list C , C = C − V6 Compute the similarities S between V and all vectors in C,

S= simliarity(V , C)7 Find similar vectors that are greater than the threshold and

add them to the clique, Clique(i) = C, where S > Q8 Update the cluster list C by removing clusters in the Clique(i),

C = C − Clique(i)19 i = i+ 1 }

Figure 3.8: The different steps of the cliquing algorithm.

representing the IAT frequency distributions of the corresponding clusters. Theclique algorithm described above was then applied to these vectors, yielding 111IAT-based cliques comprising 875 clusters. The remaining 600 clusters did not fallinto any clique. Each clique contained a group of clusters which, based upon theirIAT distribution (and the parameters of the cliquing algorithm) were similar.

Prior to describing the detailed analysis of the cliques obtained, three types ofcliques that were expected to be represented in the results are presented:

Type I: Cliques which contain clusters of large sessions targeting the sameport sequences. The difference between the various clusters contained within sucha clique lies in the number of packets sent to the targeted ports. These cliquesare mostly symptomatic of classes of attacks where the attacker repeatedly tries agiven attack, a varying number of times.

Type II: Cliques composed of clusters of large sessions targeting differentport sequences but exhibiting the same IAT profile. These cliques are symptomaticof tools that send packets to their target according to a very specific timing andwhich have been used in several distinct campaigns targeting different ports.

Type III: Cliques which contain clusters grouped together based upon thepresence of long IATs (longer than 25 hours), representing sources which are ob-


served on one platform, then, within 25 hours, are detected on another platform,before again returning to the original platform. Such behavior would be indicativeof a source which is scanning large numbers of devices across the Internet, in apredictable manner, resulting in them repeatedly returning to the same platform.

We also found many similarities across the different cliques that were generated.A number of so-called ’supercliques’ were identified as a result, which suggeststhat the IAT-based analysis focused on in this study is good at automaticallyidentifying very specific types of activity within a very large data set. Analysis ofthese supercliques is presented below.

3.4.1 Type I Cliques

Type I cliques are expected to contain clusters which are very similar with respectto most traffic features, including port sequence, with one exception being thatthe large and tiny sessions within the clusters contain varying durations (both interms of time, and the number of packets sent by the source). The variation inthe duration of the sessions accounts for such traffic being arranged in differentclusters. Two particular cliques that are seen to fall clearly into the Type I categoryare Clique 7 and Clique 49, summarized in Table 3.3.

Clique 7 Clique 49Number of Clusters 8 11Number of Large Sessions 9 285Number of Packets 821 3274Number of Platforms Targeted 5 37Number of Source IPs 6 248Number of Countries 4 46Targeted Port Sequence TCP/135 TCP/22Peak IATs (bin) 554-583 (32) 2703-3597 (49)Min, Average, Max Durations (Seconds) 4657, 70491,236350 1035, 9922, 137528No of Targeted Virtual Hosts 3 3

Table 3.3: A summary of Type I Cliques.

Clique 7 is composed of 8 clusters, 9 large sessions and a total of 821 packets. Inthis clique, there were 5 platforms targeted by 6 distinct IP addresses originatingfrom 4 different countries (China, Germany, Japan, and France). The peak IATbin was bin 32, with IAT values in the range 554-583 seconds, and the average


duration was 70491 seconds with a minimum duration of 4657 seconds, and amaximum of 236350 seconds.

All three virtual hosts on each of the targeted platforms were hit with thesame number of packets, with the average number of packets per session equal to35. Also, several IP addresses were found to occur in multiple clusters within theclique. While these sources were grouped in different clusters due to their varyingdurations, there were strong similarities in terms of the IAT characteristics of thesessions, resulting in these clusters being grouped in the same clique.

Clique 49 contains 11 clusters, 285 large sessions, and 3274 packets, and thetargeted port sequence is TCP/22. There were 248 distinct IP addresses whichattacked 37 different platforms. The sources of the IPs are widely spread among46 different countries. Despite the widespread location of the sources of the trafficin this clique, there were a number of similarities in the behavior observed. Firstly,large sessions in this clique always targeted all three virtual hosts on each platform,and the number of packets sent to each virtual host was similar in each case (onepacket for the Windows hosts and an average of 10 packets for the UNIX host).The average duration of attacks was 9922 seconds with minimum and maximumdurations in the range of 1035 to 137528 seconds. The IAT sequences of theseclusters were similar with all IATs in the session being short except one belongingto bin 49 (2703-3597 seconds).

Cliques 7 and 49 were typical examples of Type I cliques where attack trafficends up in different clusters due to the variations in either the duration of theattack or the number of packets sent. In each case, the duration and number ofpackets varied significantly between the sessions, while the IAT behavior remainedconsistent. Also, a number of IP address were shared between clusters within eachclique, with over 50 % of the clusters sharing IP addresses or class C networks.

The identification of cliques of Type I addresses a weakness of the originalclustering algorithm which was, by design, unable to group together activitiesthat clearly were related to each other and should have, therefore, been analyzedtogether.

3.4.2 Type II Cliques

Type II cliques are those which contain a large variety of targeted port sequences,yet each cluster exhibits similar IAT characteristics. It was hypothesized thatclusters belonging to this type of clique correspond to the same attack tool using


Clique 92 Clique 69Number of Clusters 40 64Number of Large Sessions 502 1336Number of Packets 4234 17097Number of Platforms Targeted 1 2Number of Source IPs 502 1300Number of Countries 25 37Targeted Port Sequence TCP {6769,

7690, 12293,18462, 29188,64697, 64783}

TCP:{4662,6769, 7690,12293, 29188,38009, 64697,64783}

Peak IATs (bin) 933-1797 (46)�1803-2702 (48)

933-1797 (46)

Min, Average, Max Durations (Seconds) 953, 9278, 53941 133, 44163,22522, 4

No of Targeted Virtual Hosts 1 1

Table 3.4: A summary of Type II Cliques.

the same strategy to probe a variety of ports (such as a worm which targets multiplevulnerable services or some other type of systematic scanner targeting a numberof different ports). Two cliques which exhibit this type of behavior are Cliques 92and 69 (see Table 3.4).

Clique 92 consists of 40 clusters, 502 large sessions and 4234 packets in total.While a variety of ports were targeted by these clusters, traffic within each clusteronly targeted a single port. The TCP ports targeted within this clique were:6769, 7690, 12293, 18462, 29188, 64697, and 64783. This clique is a result of 502distinct source IP addresses originating from 25 different countries, and targetingonly a single platform. Additionally, only one virtual host was targeted on thisplatform. The average number of packets per large session was 16 (minimum 3 andmaximum 103), and the average duration was 9278 seconds. Clique 92 containspeak IAT bins of 46 (933-1797 seconds) and 48 (1803-2702 seconds) where the IATsequences were repeated patterns of short and long IATs. A possible explanationfor the traffic which constitutes this clique is that it corresponded to the sametool being used to scan for the existence of services using a strange port (such aspeer-to-peer related services) where the scan used a regular (long) delay betweenretransmissions.

Clique 69 is similar to Clique 92 in that it also contains a variety of clusters,


where each cluster contains traffic targeting a single, unusual port. This cliquecontains 64 clusters, 1336 large sessions and 17097 packets. It was a result of 1300distinct attacking IP addresses, that originated from 37 different countries andtarget 2 platforms (all but one targeted the same platform as that targeted bythe traffic in Clique 92). The targeted TCP ports were: 4662, 6769, 7690, 12293,29188, 38009, 64697, and 64783. The durations of attacks ranged from 133 to225224 seconds with an average of 44163 seconds. The number of packets sent ineach large session was in the range 2 to 135 with an average of 25 packets. TheIAT sequences were repeated patterns of short, short, and long IATs with a peakIAT bin of 46 (933-1797 seconds).

The traffic in Cliques 92 and 69 represent a large number of distinct sourcesfrom a variety of countries targeting a variety of ports, predominantly (with onecluster being the exception) targeting the same platform in China. These cliquesrepresent very interesting activity which is difficult to characterize in further detaildue to the lack of interactivity of the honeypots on these ports. The significanceof the ports being targeted was unclear, but might be easier to determine if packetpayloads were available. The fact that all of these sources exhibited a very distinctfingerprint in terms of their IAT characteristics made the activity all the moreunusual. The identification of cliques of Type II enabled the highlighting, in asystematic way, of the existence of tools with a specific IAT profile that werereused to launch different attack campaigns against various targets. Without suchanalysis, the link that existed between the IPs belonging to different clusters in agiven clique would have remained hidden.

3.4.3 Type III Cliques

Based upon observation of the Leurré.com data over a long period of time, it wasfound that there were a number of large sessions which continuee for an extendedduration (sometimes many weeks). Of these there were a number which targetmultiple platforms within a 25 hour period, where the intervening time beforereturning to the same platform was more than 25 hours. These very long IATswere placed into bin 152 during the cliquing process. A number of cliques thatresulted from the cliquing algorithm were characterized by these long IATs, andhere two of them are investigated in detail, Cliques 31 and 66 , see Table 3.5.

Clique 31 is a large clique of 150 clusters, 3456 large sessions, and a totalof 21422 packets. The port sequence for Clique 31 is the single port UDP/1434


Clique 931 Clique 696Number of Clusters 150 3Number of Large Sessions 3456 13Number of Packets 21422 171Number of Platforms Targeted 39 12Number of Source IPs 277 9Number of Countries 22 2Targeted Port Sequence UDP 1434 UDP 1026 ,UDP

1027Peak IATs (bin) >25 hours (152) Very large (152)Min, Average, Max Durations (Seconds) 132, 1142131,

75098491, 381408,915002

No of Targeted Virtual Hosts varies 3

Table 3.5: A summary of Type III Cliques.

(MS SQL). In Clique 31, there were 277 distinct IP addresses originating from 22different countries which targeted 39 different platforms. Characteristics of clustersin this clique include: a varying number of hosts targeted, with the average numberof packets sent per host equal to 12 (minimum 2 and maximum 85) and an averageduration equal to 1142131 seconds. These sessions are indicative of a very slowscanner, seen on multiple platforms, returning to the same platform only after anextended delay of more than 25 hours.

Clique 66 contains 3 clusters, 13 large sessions and 171 packets. These sessionswere characterized by sending multiple packets, alternating between UDP ports1026 and 1027 repeatedly. In Clique 66, 12 platforms were targeted by 9 distinctIP addresses originating from 2 different countries. All clusters within this cliquecontained sessions which targeted all three virtual hosts on the target platforms,with only a small number of packets sent per session (on average 4, with a minimumof 3, and a maximum of 6). The average session duration was 381408 seconds.

Cliques 31 and 66 represent examples of activities where a source IP was scan-ning the globe, targeting different honeypot platforms in less than 25 hours. UDPport 1434 is used by the MS SQL Monitor service and is the target of severalworms, such as W32.SQLExpWorm and Slammer. It is likely that traffic targetingthis port is result of worms that scan for vulnerable servers. UDP ports 1026and 1027 are common targets for Windows Messenger spammers, who have beenrepeatedly targeting these ports since June 2003.


SuperClique

Cliques Clusters LargeSessions

DistinctIPs

Peak Bins PortSequence

1 7 166 3505 277 152 1434U2 5 12 22 12 152 1026U

1027U . . . .3 6 29 288 247 46, 48, 49 135T4 4 21 541 429 46, 48, 49 22T5 23 183 6313 6188 46, 48, 49 Unusual

TCP ports6 23 74 164 152 31, 32 135T

Table 3.6: Representative properties of Supercliques.

3.4.4 Supercliques

It was observed that across all of the obtained cliques, only a relatively smallnumber of peak IAT bin values were represented. Indeed, from the point of viewof the peak bin values, it was found that a limited number of combinations existed.This suggests that the cliques we obtained possessed a high level of uniformity interms of the activities that they represent. Based upon the small set of commonpeak bins, and the dominant port sequences targeted within those cliques, thecliques were manually grouped together into 6 supercliques, which are summarizedin Table 3.6.

As can be seen from the table, the supercliques accounted for just over half ofthe cliques generated. The cliques not represented within the supercliques werenot considered in the remaining analysis. Representative examples of each of thefirst five supercliques have been presented in the previous three sections. The TypeI Cliques 7 and 49 are examples of Supercliques 3 and 4, respectively. Superclique6 contains Type I cliques which target port TCP/135, similar to Superclique 3,with the difference being that the dominant IAT for cliques from Superclique 6 arein bins 31 and 32, rather than 46, 48, and 49 (for Superclique 3). Cliques 92 and69 (Type II) are examples of cliques from Superclique 5. The Type III Clique 31is an example of a clique that belongs to Superclique 1; while Type III Clique 66is an example of a clique from Superclique 2.


3.5 Summary

Due to the low-interaction nature of the honeypots used by the Leurré.com project,attempts to cluster the low-interaction honeypot traffic on packet level result in ahuge number of clusters. Consequently, it becomes very difficult to interpret theseclusters or to reach accurate conclusions about the exact nature of the tools thatgenerate them.

In this chapter, we have presented a methodology that overcomes the weak-nesses in Leurré.com’s clustering algorithm. The use of packet arrival times dis-tributions has generated a number of cliques that represent sets of clusters anda variety of interesting activities which target the Leurré.com environments. Itwas shown that more than half of the cliques can be easily characterized as oneof the three major types. In accordance with the supercliques that were manuallyidentified, there are six major classes of activity that the cliquing algorithm ex-tracted for the time period that was examined. The strong similarities within thesupercliques highlight the usefulness of the cliquing algorithm for identifying veryparticular kinds of traffic observed by the honeypots.

While the proposed method was efficient in improving the existing clusteringapproach, there are several limitations to the work described in this chapter. Thefirst limitation concerns the manual extraction of the IAT distributions of clustersand the grouping and interpretation of the results. Secondly, the methodologyrequires moderate to intensive computational power. Consequently, it is necessaryto address the need for a better traffic analysis technique that suits the natureof low-interaction honeypot traffic, which is sparse, and is capable of extractinguseful attack patterns automatically, and is suitable for a real-time application.

In the next chapter, a new methodology will be presented, which bypasses thetraffic clustering imposed by the Leurré.com project. The new methodology over-comes the previous method’s limitations, and is capable of working on a massivedata set. The method is based on a well established data reduction technique andworks on aggregated traffic flows rather than the packet level.

Chapter 4

Honeypot Traffic Structure

Running a honeypot that is connected to the Internet results in collecting a massiveamount of malicious Internet traffic. The collected traffic is very representative ofglobal Internet, artefacts such as different types of attacks, traces of misconfiguredtraffic, and backscatter. The first step to better understand the nature of thistraffic is through detecting the different classes of activity and measuring theirprevalence.

However, analysis of honeypot traffic comes with several challenges, which in-clude: the high dimensionality of the data, resulting in a large number of features;the large amounts of collected traffic, resulting in high storage and computationalrequirements; and Internet noise, such as scans, which obfuscates useful attackpatterns.

In the previous chapter, the Leurré.com methodology of manipulating honeypottraffic was explored, and it was demonstrated that their clustering algorithm hasresulted in a large number of clusters, over 27,000. Furthermore, a new approachwas proposed for improving the cluster interpretation through grouping clustersthat share similar IAT distributions.

In this chapter, the study seeks to explore questions related to deploying ahoneypot from a strategic viewpoint, which are: What types of activities can bedetected with a low-interaction honeypot? What are the relative frequencies of thedetected activities? What are the interrelationships between these activities? Toanswer these questions, we embarked upon an analysis of the use of a multivari-ate statistical technique, principal component analysis (PCA), for characterizing

59

60 Chapter 4. Honeypot Traffic Structure

attackers’ activities present in honeypot traffic data in terms of structure and size.The use of PCA in this study is motivated by:

• the popularity of PCA as one of the best exploratory and data reductiontechniques [54];

• the facts that the extracted principal components are uncorrelated and thefirst few principal components retain most of the variation in the originaldata;

• the ease of implementation and the low computational requirement; and

• the lack of any distributional assumptions, which makes PCA suitable formany types of data.

The main contribution of this chapter is the application of principal componentanalysis (PCA) in three areas: in detecting the structure of attackers’ activities inlow-interaction honeypot traffic; in visualizing these activities; and in identifyingdifferent types of outliers. The following chapters, Chapters 5 and 6, build on thiswork for further applications of PCA to analyze honeypot traffic. The researchfindings were presented in:

• S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, ”Characterization ofAttackers’ Activities in Honeypot Traffic Using Principal Component Analy-sis”, in Proceedings of the International Conference on Network and ParallelComputing (IFIP), Shanghai, China, 2008

This chapter is organized as follows. Section (4.1) provides an introduction anddiscusses the motivation behind the work. Section 4.2 introduces the concept ofprincipal component analysis. Section 4.3 describes the data set used and the pre-processing that has been applied to the traffic data. Principal component analysison the honeypot data set is described in Section 4.4. Interpretations of the majorprincipal components are presented in Section 4.5. The interrelations betweencomponents are presented in Section (4.6). The identification of extreme activitiesis discussed in Section 4.7. Finally, Section 4.8 summarizes the chapter.

4.1 Motivation

Monitoring and characterizing Internet threats is critical in order to better protectproduction systems by gaining an understanding of how attacks work, and conse-

4.2. Principal Component Analysis 61

quently protecting systems from them. Honeypots are a valuable tool for collect-ing different types of attack traffic. However, characterizing attackers’ activitiespresent in honeypot traffic data can be challenging due to the high dimension ofthe data (or large number of variables) and the large volumes of traffic data col-lected. The large amount of background noise, such as scans and backscatter, addsto the challenge by hiding interesting abnormal activities that require immediateattention from security personnel. Detecting these activities can potentially be ofhigh value and give early signs of new vulnerabilities or breakouts of new auto-mated malicious codes, such as worms, but only if the honeypot data is handledin time.

Principal component analysis (PCA) is a widely used multivariate statisticaltechnique for reducing the dimensionality of variables, unveiling latent structuresand detecting outliers in data sets [74, 75]. In this research, principal componentanalysis (PCA) is used to detect the structure of attackers’ activities in honeypottraffic, to visualize these activities, and to identify different types of outliers.

4.2 Principal Component Analysis

Principal component analysis (PCA) is a multivariate statistical technique thathas been widely used in multi-disciplinary research areas such as Internet trafficanalysis, economics, image processing, and genetics, to name only a few. PCAis mainly used to reduce the dimensionality of a data set into a few uncorrelatedvariables, principal components (PCs), which retain most of the variation in theoriginal data. The resulting principal components are a linear combination of theoriginal variables, are orthogonal, and ordered with the first principal componenthaving the largest variance. Although the number of resulting principal compo-nents is equal to the number of original variables, much of the variance in theoriginal set of p variables can be retained by the first k PCs, where k < p. Thus,the original p variables can be replaced by the new k principal components.

Let X = (X1, .., Xp)T be a matrix of p-dimensional data variables, where Cis the covariance or correlation matrix of X. We seek to find a lower dimensionmatrix A = (A1, .., Ak) of C, that solves the following equation:

A−1CA = L (4.1)

where A is the matrix of eigenvectors of C and is orthogonal (A−1A = I ), and


L is the diagonal matrix of eigenvalues of C and is greater than or equal to zero.Then, principal component analysis becomes

Z = ATX (4.2)

Then, the first linear combination Z1 of X having a maximum variance be-comes:

Z1 = AT1X = a11X1 + a12X2 + a13X3 · · ·+ a1pXp (4.3)

The second linear combination of X, Z2, is uncorrelated with Z1 and has thesecond largest variance and so on until the kth function, Zk, is found which isuncorrelated with Z1, . . , Zk−1:

Z2 = AT2X = a21X1 + a22X2 + a23X3 · · ·+ a2pXp

...Zk = ATkX = ak1X1 + ak2X2 + ak3X3 · · ·+ akpXp

(4.4)

Geometrically, principal component analysis represents a shift of the origin ofthe original coordinates (X1, .., Xp) to (X1, .., Xp) and then a rotation of theseoriginal coordinate axes into new axes (Z1, .., Zp) that are orthogonal and whichrepresent the direction of maximum variability. Figure 4.1 illustrates a lineartransformation of two random vectors X1 and X2 into the direction of maximumvariance, Z1 and Z2.

1Z2Z

1X

2X

1X

2X

Figure 4.1: Directions of maximal variance of principal components (Z1, Z2).

4.3. Data set and Pre-Processing 63

In addition, the principal components define axes of a p-dimensional ellipsoidthat is centered at the mean and has its semi-major and semi-minor axes’ lengthsequal to half the square root of the eigenvalues (l

121 , l

122 , .., l

12p ). The equation of the

ellipsoid becomes:

p∑k=1

Z2ik

lk= (Xi − X)T S−1 (Xi − X) (4.5)

where Zik is the score of the kth PC of the ith observation and lk is the kth

eigenvalue. It will be seen later in this chapter how the previous equations can beutilized in setting up the threshold for detecting outliers.

In the following sections, PCA will be utilized in detecting different groupsof activities found in the honeypot traffic, without any assumptions about thesegroups or the interrelationships between them. In addition, some of our objectivesin this study are to explore the usefulness of PCA in visualizing honeypot trafficand in detecting outliers. While the application of PCA to network traffic analysisis not new, to the best of our knowledge this is the first time that PCA has beenused to analyze low-interaction honeypot traffic. As will be demonstrated, thetechnique shows much promise in extracting much value from this sparse data andpaves the way for further applications of the technique in the following chapters.

4.3 Data set and Pre-Processing

In this section, the data set used in this study and the pre-processing that hasbeen applied to the data are described. The traffic features used in the analysisand the steps for applying PCA to these traffic features are also discussed in thefollowing subsections.

4.3.1 Data set

The honeypot traffic data used in this analysis came from the Leurré.com project[9]. For the purpose of this study, only one low-interaction honeypot sensor’s datawas used due to the availability of log files. Traffic data for the period of September15 until November 30, 2007, for two of the honeypot environments were included,namely Windows 2000 Professional and Windows 2000 Server. Both environmentsare identical in terms of open ports, TCP and UDP. The traffic traces consisted of839663 packets in total which were the result of attacks from over 5400 different


IP addresses, see Table 4.1. This data set was the most recent honeypot trafficdata set available from the Leurré.com project when this study was conducted,and it represented the current patterns of attackers’ behaviors.

Start Date End Date Packets Standard Flows Activity Flows15/09/2007 30/11/2007 839663 562470 5401

Table 4.1: Summary of the data set used in this study.

4.3.2 Pre-processing

Before applying the PCA to the traffic data, the following steps were performedto process the raw traffic data. First, raw TCPDump [24] files of daily honeypotdata were collected and merged into a single traffic file. Packets were then groupedtogether into basic flows (according to the notation of flow [46]). The basic flowconformed to the standard definition of an IP flow of packets that share the fivekeys: source IP address, destination IP address, source port, destination port, andprotocol type. If a packet differed from another packet by any key field, it wasconsidered to belong to another flow [2, 132].

Other features associated with flows were also extracted to enrich the analysis.These features include number of packets, number of bytes, total activities, anddurations. For the purpose of this study, the timeout of basic flows was set to amaximum of five minutes. The five-minute timeout parameter was selected basedon our experiments and on the nature of low-interaction honeypots where themajority of flows were less than 300 seconds. A higher value of time out has littleinfluence on the final results.

The second step was to group the basic flows again into activity flows, wherethe newly generated flows were combined, based upon the source IP address of theattacker with a maximum of sixty minutes inter-arrival time between basic flows.The aggregation of basic flows into activity flows was necessary to overcome thelow level of detail in low-interaction honeypot collected traffic, to be representativeof the behavior of the three monitored protocols, TCP, UDP and ICMP, and toaccount for a variety of network anomalies.

Finally, the data was filtered to remove Internet backscatter by examining eachflow individually against common backscatter flags, such as TCP RST and TCPSYN/ACK [100]. The Leurré.com project platform is less effective in collectingbackscatter, because of the limited view of its platform to the Internet; the number

4.4. PCA on the Honeypot Data set 65

of monitored IP addresses are three in total.

4.3.3 Candidate Feature Selection

Eighteen features were extracted from the activity flows. These traffic featureswere selected as being representative of the behavior of the three protocols thatare monitored by the honeypot, namely TCP, UDP and ICMP. Table 4.2 lists theselected variables and their descriptions.

Traffic features were selected to account for a variety of network anomalies [33].Since these traffic features were extracted from aggregated traffic flows, they werevery efficient in detecting different types of attacks using single or multiple connec-tions and attacks spanning different protocols. For example, the total number offlows, source packets, and source bytes allow the detection of anomalies in trafficvolume, total activities and IAT help in detecting denial of service attacks [72]and mis-configurations [151], and TCP and UDP ports allow the characterizationof a wide range of network anomalies such as scans and worms. Finally, since nowork is performed directly on these traffic features, principal component analysisis relied on in removing redundancies and correlations among these variables.

4.4 PCA on the Honeypot Data set

Principal component analysis can be calculated using either the covariance matrixor the correlation matrix. However, PCs defined using the covariance matrix arevery sensitive to the unit of measurement of variables. In addition, when thevariance of the variables differs widely, which is the case for the honeypot data, thefirst few PCs will be dominated by variables with high variances, as they contributelittle information to the structure of the data set. Moreover, one drawback of PCson covariance matrices, with different units of measurement, is the difficulty ininterpreting the PC scores. Thus, the use of correlation rather than the covariancematrix for deriving the PCs was preferred in this analysis.

Calculating the PCA based on the correlation matrix involves the followingsteps:

1. Arrange the extracted n traffic vectors data of p features into a data matrixX where each observation is represented by a single column and p rows .


No. Variable Description1 TF Total number of basic flows generated by individual

IPs and aggregated based on sixty minutes2 TCP_O Total number of open TCP ports targeted3 D_TCP_O Total number of distinct open TCP ports targeted4 TCP_C Total number of closed TCP ports targeted5 D_TCP_C Total number of distinct closed TCP ports targeted6 UDP_O Total number of open UDP ports targeted7 D_UDP_O Total number of distinct open UDP ports targeted8 UDP_C Total number of closed UDP ports targeted9 D_UDP_C Total number of distinct closed UDP ports targeted10 ICMP Total number of ICMP flows11 TM Total number of machines targeted12 DUR Total duration of basic flows13 SPKTS Total number of source packets14 SBYTES Total number of source bytes15 SRATE Total of source rates of basic flows, where a source

rate is the number of source packets in a basic flowdivided by the duration of that flow

16 AVG_PK_SIZE Sum of average packet size17 T_ACT Total Activities, the summation of source and

destination rates18 IAT Total inter-arrival times between basic flows

Table 4.2: Variables used in the analysis.

Then the traffic matrix X becomes:

X(p×n) =

X1......Xp

=

X1,1 X1,2 · · · X1,n

X2,1 X2,2 · · · X2,n... ... ...

Xp,1 Xp,2 · · · Xp,n

(4.6)

2. Standardize the p-dimensional matrix X = (X1, .., Xp)T , to have zero meanand unit variance, by:

Yij = Xij − Xi√σi

(4.7)

where Xi is the sample mean of Xi , for i=1,.,p, and σi is the sample varianceof Xi.

4.4. PCA on the Honeypot Data set 67

3. Compute the sample correlation matrix Rp×p of Yn×p :

R(p×p) =

1 r1,2 · · · r1,p

r2,1 1 · · · r2,p... ... . . . ...rp,1 rp,2 · · · 1

(4.8)

4. Find the matrix of eigenvectors A = (A1, .., Ap) and eigenvalue vector L =(l1, .., lp) of R:

A =(A1 A2 · · · Ap

)=

a1,1 a1,2 · · · a1,p

a2,1 a2,2 · · · a2,p... ... ...ap,1 ap,2 · · · ap,p

(4.9)

L =(l1 l2 · · · lp

)(4.10)

Eigenvectors are rearranged in a matrix where each column represents aneigenvector and they are reordered to match their equivalent eigenvalue,which are in decreasing order, l1 > l2 > .. > lp > 0.

5. Select a subset E = (A1, .., Ak) of A, according to criteria that will be dis-cussed later in this section, where k<p:

E =(E1 E2 · · · Ek

)=

e1,1 e1,2 · · · e1,k

e2,1 e1,2 · · · e2,k... ... ...ep,1 ep,2 · · · ep,k

(4.11)

6. Calculate the principal components’ scores by projecting the standardizeddata Y onto the reduced vector E

Z = ETY (4.12)

The first principal component then equals:

Z1 = E1Y = e11Y1 + e12Y2 + · · ·+ e1pYp (4.13)


and the kth principal component becomes:

Zk = EkY = ek1Y1 + ek2Y2 + · · ·+ ekpYp (4.14)

4.4.1 Number of Principal Components to Retain

In PCA, components are in decreasing order where the most important compo-nent, listed first, has the highest variance. Consequently, only the first few PCs areretained as they explain most of the variance in the data. There exist several meth-ods for deciding how many PCs to retain. Kaiser’s rule [120] for eliminating PCswith an eigenvalue less than one suggests retaining the first six components (seeTable 4.3 for the eigenvalues). The cumulative total variance of these componentsis 80% of the total variance of the original data, which means that the majority ofthe variance in the data has been accounted for in the extracted components.

Principal Component Eigenvalues % of Variance Cumulative %1 5.410 30.054 30.0542 2.374 13.190 43.2443 2.153 11.959 55.2044 1.681 9.339 64.5435 1.432 7.954 72.4976 1.362 7.567 80.0647 0.959 5.329 85.3938 0.777 4.318 89.7119 0.687 3.817 93.52810 0.568 3.156 96.68311 0.284 1.579 98.26212 0.176 0.976 99.23813 0.121 0.673 99.91114 0.011 0.063 99.97415 0.005 0.026 99.999816 0.3e-4 1.5e-6 99.9999717 0.6e-5 3.1e-8 99.9999918 0.1e-7 1.8e-16 100.000

Table 4.3: The extracted principal components and their variances.

The Scree plot of energy contributed by each component is summarized inFigure 5.1. This plot suggests that six components can be retained as a sharpdrop occurs between the sixth and seventh component, where the eigenvalues are

4.5. Interpretation of the Results 69

greater than or equal to 1. The sharp drop in the curve indicates a typical cut-offfor selecting the correct number of components to be considered in the analysis.

0 2 4 6 8 10 12 14 16 180

1

2

3

4

5

6

Component Number

Eig

en

valu

e

Figure 4.2: Scree plot of eigenvalues.

However, the extracted communality of variables, i.e. an amount of variancewithin each variable accounted for by the components, indicates that one of thevariables, namely the total number of distinct open TCP ports targeted, has a lowextraction value (see Table 4.4). This suggests the inclusion of more components.After the inclusion of the seventh component, all the communalities are high, whichindicates that the extracted components represent the variables well.

All of the above supports the decision to retain seven components, with the restof the components being eliminated. Table 4.3 shows the accumulated percentagesof the total variances of the 18 extracted components. The first seven componentscontribute over 85% of the total variance in the original data, which suggests thatthe extracted components are very representative of the data.

4.5 Interpretation of the Results

As mentioned earlier, components are listed in decreasing order of importance,where components with larger variances are more important and give more infor-mation about the data. The components were rotated to simplify the analysis andmake the interpretation easier.


Variable Communalities ofExtraction 1 (Six

components)

Communalities ofExtraction 2 (Seven

Components)TF .998 .999TCP_O .998 .999D_TCP_O .337 .794TCP_C .861 .882D_TCP_C .906 .927UDP_O .637 .815D_UDP_O .625 .912UDP_C .574 .703D_UDP_C .892 .898ICMP .749 .814TM .717 .892DUR .996 .997SPKTS .998 .998SBYTES .993 .994SRATE .991 .991AVG_PK_SIZE .561 .778T_ACT .995 .995IAT .585 .752

Table 4.4: The extracted communalities of variables.

Interpretation of the components is achieved by examining the loading of thevariables for each component, as variables with high loading are of high significancein the interpretation. Then, each PC’s interpretation was validated by inspectingsample traffic against the original data. For this study, variables with a loadingvalue over 0.6 were selected as they are the most significant in the analysis. Figure4.5 shows the Varimax rotation of principal components. The goal of the rotationis to ease interpretation by increasing the contrasts between variables’ loadings.

Interpretation of the first seven PCs (PC1-PC7) for the honeypot data is givenin Table 4.6. The first component (PC1) is highly correlated with the total numberof basic flows, total number of TCP ports targeted, total duration of basic flows,total number of source packets, and total number of source bytes. The first com-ponent indicates high interactions between attackers and the honeypot on openports and, as the variance suggests, is the most important component. PC2 ishighly correlated with closed TCP ports. This component suggests vertical andhorizontal scan activities which focus on very specific ports. In PC3, activitiestarget closed UDP ports and could be interpreted as spam, worm activities, or

4.6. Interrelations Between Components 71

PC1 PC2 PC3 PC4 PC5 PC6 PC7TF .995 -.009 .000 .096 -.014 .001 -.004TCP_O .995 -.009 .000 .096 -.014 .001 -.004D_TCP_O .077 .310 -.065 .050 .004 -.019 .827TCP_C -.020 .882 -.013 .283 .066 -.028 .135D_TCP_C .011 .950 -.017- -.049 -.017 -.016 .148UDP_O .005 -.027 -.005 -.030 .131 .800 .045D_UDP_O -.012 .016 -.012 .063 -.009 .790 -.042UDP_C .006 .063 .797 -.003 .022 -.046 -.229D_UDP_C -.006 -.009 .945 -.004 -.050 .000 .044ICMP -.046 -.469 -.198 .036 .654 -.269 .191TM -.010 .242 -.043 -.008 .841 .013 -.076DUR .994 -.015 .001 .095 -.016 .002 -.015SPKTS .992 .018 -.005 .101 -.005 -.005 .050SBYTES .989 .029 -.005 .102 -.002 -.008 .075SRATE .073 .102 .000 .986 .014 .027 .035AVG_PK_SIZE -.012 -.115 .682 .010 -.082 .060 .512T_ACT .434 .088 .000 .893 .005 .022 .020IAT -.005 -.072 .046 .011 .728 .224 -.016

Table 4.5: The Varimax rotation of principal components.

mis-configured servers. PC4 is related to repeated activities over a short periodof time; this is explained by the high correlations between the total activities andvariables in the first PC’s variables, such as SPKTS, SBYTES, DUR, TCP_O,and TF. PC5 represents the total machines targeted and ICMP traffic. It can beinferred that these activities are of IPs sweeping the globe seeking live machines.PC6 represents activities that target open UDP ports. PC7 is a subset of the firstcomponent and represents short attacks against specific open ports, mainly port80, 139, and 445.

The principal component analysis of the data shows that there are at leastseven clusters of activities represented in the data. These clusters of activities canbe separated and then PCA can be applied further to find new sub-clusters ofactivities and the process repeated.

4.6 Interrelations Between Components

Plots of PCs can serve two main purposes: to define the interrelations betweencomponents and to identify outliers. As discussed in Section 5 (interpretation of


PrincipalComponent

Percentageof Variation

Interpretation

1 30.054 % Targeted attacks against open ports2 13.190 % Scan activities3 11.959 % Spam or miss-configuration4 9.339 % Repeated short activities5 7.954 % Detection activities6 7.567 % Targeted attacks against open UDP ports7 5.329 % Short attacks

Table 4.6: Interpretations of the first seven components.

the results), the two components PC2 and PC5 represent two types of activities:TCP scanning and live machine detection respectively. The interrelationshipsbetween these two components are presented in Figure 4.3.

Figure 4.3: The scatter plot of TCP scan (PC2) vs live machine detection (PC5).

The figure shows that there are at least two clusters of activities: detectionwith very few scans, on the upper left side of the figure along the PC5 axis; andscans with very few machine detection activities at the bottom of the figure alongthe PC2 axis. Mixed activities, moderate rate of scans with moderate live machinedetection activities are located in the middle part of the figure. Extreme activitiesof scanning and live machine detection activities are also visible as far points alongboth PC axes.

4.7. Identification of Extreme Activities 73

An example of scan-only activities is observation 4253, which originated inGermany. The IP scanned all machines for closed port 2967 and then, two weekslater, scanned closed port 5904. Observation 304 is an example of the secondtype, live machine detector. The IP originated in Japan and was only involved indetection activities.

4.7 Identification of Extreme Activities

Outliers, in statistics, can be defined as observations that deviate significantlyfrom the rest of the data [35, 135]. In honeypot traffic, outliers are extremeactivities that are distanced from the p-dimensional hyperspace defined by thevariables. Detecting extreme activities in honeypot traffic is analogous to outlierdetection in multivariate statistics. In this study, we are concerned with twotypes of extreme activities (outliers) in low-interaction honeypot traffic: modeland structure extreme activities. The first type (type I), model extreme activities,represents activities that have high values across some or all the variables. Incontrast, the second type (type II), structure extreme activities, represents trafficactivities that violate the structure of the data represented by the main principalcomponents.

The aim of detecting these extreme activities is to help in searching for the rootcauses of variations in patterns of the defined structures and to take measures toprotect production networks against them. Extreme activities in honeypot trafficmight arise from the introduction of new malicious network activities or intensiveexisting activities, such as releases of newly automated codes (worms) or discoveryof new vulnerabilities, or even mis-configured servers.

One of the challenges in detecting outliers in high dimension data, such ashoneypot traffic, is the difficulty of inspecting large numbers of variables in thedata set simultaneously. In addition, inspecting each variable by itself, or eveninspecting plots of pairs of variables, might not reveal any extreme behavior whenthe combination of multiple variables is considered an outlier. This study providesa preliminary investigation of utilizing principal component analysis in detectingextreme observations, through graphical inspection of the first few and last fewprincipal components’ plots, and through the statistics of squares of the weightedprincipal component scores against the squared Mahalanobis distance.

Inspecting two and three dimensional scatter plots of the first few and last few


PCs for detecting outlying observations was suggested by Gnanadesikan [65]. Thiswas justified, since the first few PCs are good at detecting outliers that inflate thecorrelations (model outliers), while the last few PCs are useful in detecting outliersthat add unimportant dimensions to the data and which are hard to distinguishfrom the original variables (structure outliers) .

Figure 4.4: The scatter plot of the first two principal components.

The scatter plot of the first two principal components is illustrated in Figure4.4. These two components, PC1 and PC2, account for 43% of the total variance inthe data. The first component has high loading values on multiple variables: totalnumber of basic flows, total number of open TCP ports targeted, total durationsof basic flows, total number of source packets, and total number of source bytes.The second component has two variables with high loadings on total number ofclosed TCP ports and distinct closed TCP ports targeted. Outlying observations(circled on the plot) can be spotted, in Figure 4.4, as points that have extremevalues along the principal component axes near the edges, far from the body ofdata. Observations 4124, 4900, 3929, 4892, 4131, 3720, 426, 4890 and 428 areextreme on the first principal component (PC1) while observations 426 and 428appear extreme on the second principal component (PC2). Specific outliers arediscussed in Section 4.7.1.

The scatter plot of the last two principal components, PC17 and PC18, whichaccount for less than 1% of the total variance, is illustrated in Figure 4.5. There


Figure 4.5: The scatter plot of the last two components.

are two observations, 4131 and 1193 that appear extreme for PC17 and PC18 nearthe edges of the graph.

Although scatter plots of principal components are very useful for spottingoutlying observations visually, automatic detection of outlying observations canbe achieved through construction of a control ellipse. As the contours of constantprobability for p-dimensional normal distribution are ellipsoids [74], the ellipsoiddefined by random vectors X has the following characteristics:

• Constant probability contour for the distribution of X is defined by

p∑k=1

Z2ik

lk= Const (4.15)

where Zik is the score of kth PC of ith observation and lk is the kth eigenvalue.The ellipsoid is centered at the mean and its axes lies along the principalcomponents, where half the square root of the eigenvalues (l

121 , l

122 , .., l

12p ) are

the lengths of its semi-major and semi-minor axes.

• The ellipsoid of p-dimensional space of x value satisfies

p∑k=1

Z2ik

lk≤ x2

p(α) (4.16)


where x2p(α) is the percentile of a chi-square distribution with p degrees of

freedom.

Setting a threshold for detecting outlying observations based on x2p(α) requires

the distribution of X to be multivariate normal. However, since we do not makeany assumptions about the distribution of our data, the population ellipsoid, inEquation 4.15, is still valid despite any normality assumption, while the ellipsoidloses its interpretation as contours of constant probability [74], and the thresholdcan be computed from the empirical distribution of the principal components.

Figure 4.6 provides a zoom into Figure 4.4, omitting the very clear outliers anda sketch of the control ellipse for the first two principal components.

Figure 4.6: The ellipse of a constant distance.

Jollife [74] discussed the uses of the sum of the squares of the weighted principalcomponent scores of the last principal components in detecting outliers that arehard to distinguish from the original variables, which is given by:

Di =p∑

k=p−q+1

Z2ik

lk

where q < p and Zik is the score of kth PC of ith observation and lk is the kth

eigenvalue. When q = p, the equation represents the squared Mahalanobis distanceof the ith observation from the mean of the data, which is given by:


M2i (xi) = (Xi − X)T S−1 (Xi − X) (4.17)

0 500 1000 1500 2000 2500 3000 35000

1000

2000

3000

4000

5000

6000

Squ

ared

Mah

alan

obis

Dis

tanc

e -

Sum

( (

PC

s)2 /E

igen

valu

e) f

or q

=7

Sum ( (PCs)2/Eigenvalue) for q=7

1980

426

614

4124

428

49004917

3620

Figure 4.7: The scatter plot of the statistics Di vs. (M2i −Di).

Figure 4.7 provides a scatter plot of the statisticsDi vs. (M2i −Di) for detecting

outliers that are different from the first p − q [53]. Finally, most of the detectedoutlying observations were identified by more than one statistic, but with differentordering. Table 4.7 lists the top five outliers, ordered according to their significance(from high to low).

PC1 PC2 PC17 PC18 M2 Di M2i −Di

4124 426 4131 1193 1980 614 19804900 428 5094 4900 426 426 4263929 2228 5386 4426 614 4917 41244892 3222 4124 4882 4124 3620 4284131 362 4915 419 428 4131 4900

Table 4.7: The top five extreme observations.


4.7.1 A Discussion of the Detected Outliers

To judge the significance of the detected outliers, sample points of the outlyingobservations were manually inspected against the original data set to explain thereasons these points were selected as outliers.

Observation 4124, which is extreme in PC1, Figure 4.4, was a result of an attackfrom an IP in the USA targeting one machine on a single open TCPport, port 80. The attack started on Wednesday, 21 November 2007at 06:18:44 GMT and ended on Friday, 23 November 2007 at 08:01:08GMT. The attack generated over 150,062 packets. Observation 4124was also extreme on M2, (M2

i −Di), and PC17. While it is very hardto reach a definitive conclusion of its exact nature, due to the low levelof detail in the low-interaction honeypot traffic, this observation re-sembles denial of service attacks and falls under the first type, extremeexisting activities.

Observation 2228 is extreme on both PC2 and (M2i −Di) statistics. The attacking

IP originated in China and lasted for less than 10 seconds. It was acombination of ICMPs and moderate scans of seven unusual closedTCP ports and one TCP open port, port 80. The attacker targeted allmachines on the honeypot environment. This observation representsan intensive version of existing scanning activities.

Observation 1193 is extreme on both PC17 and PC18. The IP address originatedin Thailand and lasted for 40 minutes. It was mainly alternating con-nections to two open TCP ports 445 and 139 and one closed TCP port9988. This observation has large value on TCP_O variable, moderatevalues on TF, DUR, and SPKTS variables, and low values on TC_C,and ICMP variables.

Observation 614 is an outlier on M2 and Di statistics. It was caused by an attackfrom an IP address that originated in Romania and lasted for 40 min-utes. It was mainly connections to two open TCP ports (445 and 139)and one closed TCP port (9988). This observation shows similar be-haviors to observation 1193 with the same duration, but with differentIPs from different countries a week later. Observations 1193 and 614represent structure extreme activities (type II). The pattern of these

4.8. Summary 79

activities is the same as a class of worms that targets Microsoft LSASSvulnerability.

Observation 1980 generated a large amount of UDP traffic (two packets every 30minutes) against port 137. The attack took place between Thursday,18 Oct 2007 and Wednesday, 24 Oct 2007, and has large UDP_O andIAT values. This observation is on the top lists of outliers on both M2

and (M2i − Di). Observation 1980 is most likely caused by worms or

other malicious activities that scan and exploit Netbios Name Service.Observation 1980 represents a type II or structure extreme activity.

The main source of difference between the two statistics M2 and (M2i −Di) was

due to the value of q in Di statistics. More experiments are needed to selectan appropriate value for the current data set. Moreover, setting a higher valuefor activity flow time-out, currently 60 minutes, would improve the detection ofattacks that propagate slowly over an extended period, such as observations 1980and 3105.

4.8 Summary

In this chapter, the use of principal component analysis (PCA) on the traffic flowsof low-interaction honeypots has been proposed. PCA was proven to be very pow-erful tool in detecting the structure of attackers’ activities and the decompositionof the traffic into dominant clusters. Moreover, scatter plots of the PCs werevery efficient in looking at the interrelationships between components or groupsof activities and in identifying outliers. Experimental results on real traffic datashowed that principal component analysis provides very simple and efficient visualsummaries of honeypot traffic and attackers activities.

In the next chapters, it will be shown how PCA can be used to detect newattacks that have not previously been seen by the honeypot.

Chapter 5

Detecting New Attacks

The previous chapter proposed the use of principal component analysis (PCA) incharacterizing honeypot traffic. The strength of principal component analysis hasbeen shown in unveiling honeypot traffic structure, visualizing attackers’ activities,and detecting outliers. The analysis is extended further to benefit from PCA’sstrength in detecting different types of outliers. This chapter presents a techniquefor detecting new attacks in low-interaction honeypot traffic through the use ofprincipal component’s residual space.

The main contribution of this chapter is the detection of new attacks usingthe residuals of principal component analysis (PCA) and the square predictionerror (SPE) statistic. The research work described in this chapter has led to thepublication of the following paper:

• S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, "A Technique forDetecting New Attacks in Low-Interaction Honeypot Traffic", in the Pro-ceedings of the Fourth International Conference on Internet Monitoring andProtection, Venice, Italy: IEEE Computer Society 2009.

The chapter is structured as follows. Section 5.1 provides a brief introductionto the methodology of detecting attacks in honeypot traffic. Section 5.2 intro-duces principal component’s residual space and the square prediction error (SPE)statistic. The data set used in this study, the prepossessing, and the detectionmode architecture are described in Section 5.3. A practical step-by-step illus-trative example is demonstrated in Section 5.4. The results and the evaluation

81

82 Chapter 5. Detecting New Attacks

of the detection technique are discussed in Section 5.5. Finally, the chapter issummarized in Section 5.6.

5.1 Introduction

The method presented in this chapter for detecting attacks draws its roots fromanomaly intrusion detection, through building a model of the honeypot’s profile,and using multivariate statistical technique capabilities, namely principal com-ponent analysis, in detecting different types of outliers. New observations areprojected onto the residuals’ space of the least significant principal components,and their distances from the main PCA hyperspace, defined by the first k princi-pal components, are measured using the square prediction error (SPE) statistic.A higher SPE value indicates that the new observation represents a new directionthat has not been captured by the PCA model of attacks seen in the historicalhoneypot traffic.

A number of researchers have used principal component analysis (PCA) toidentify attacks [88, 25, 128, 42, 86]. However, previous applications of PCAtreat the network traffic as a composition of normal and anomalous activities, andthe detection model is then built from the normal part. The notion of normaland anomalous does not apply in honeypot traffic, where all traffic is potentiallymalicious. Thus, our technique is different from those techniques, in using PCA tobuild a model of existing attacks that have been seen in the past, and then usingthe residual space to detect any large deviation from the attack model as either anew attack vector or an attack that is not present in the model historical data.

5.2 Principal Component’s Residual Space

There are two different types of spaces that are defined by the principal componentanalysis:

• the main PC space that captures most of the variations in the original data,which is defined by the first k PCs. Most of the PCA, such as data reduction,is based on utilizing this space; and

• the residual space that represents the insignificant variations that is definedby the last (p-k) principal components with the smallest eigenvalues.

5.2. Principal Component’s Residual Space 83

Principal component analysis can be expressed as:

Z = ATX (5.1)

where Z is the principal scores of projecting observation in X onto the eigenvectormatrix A. The previous equation can be represented in the original coordinates,by projecting Z back onto A, then X becomes:

X =k∑i=1

AiZi +p∑

j=k+1AjZj (5.2)

X =k∑i=1

AiZi + E = X + E (5.3)

and the residual matrix E represents the difference between X and X

E = X − X (5.4)

Outliers detected by principal component analysis are divided into two cate-gories, based on the principal component space [75]:

• General outliers, type I in the previous chapter, that inflate the variance.PCA is very effective in detecting outliers of this type, but they are alsodetectable through inspecting variables individually or by using other mul-tivariate techniques such as Mahalanobis distance.

• Specific outliers (type II). These outliers represent new directions that arenot captured by the PC model and can be detected using the sum of squaresof the residuals or Q statistics.

5.2.1 Square Prediction Error (SPE)

Outliers that represent new directions to the PCA model can be tested using Qstatistics, or the square prediction error (SPE), which is defined as [75]:

Q = ETE (5.5)

The square prediction error (SPE) measures the sum of squares of the distanceof E from the main space defined by the PCA model. Alternatively, the squareprediction error (SPE) can be calculated as :


Q =p∑

i=k+1

Z2i

li(5.6)

The new observation is considered an outlier in the model if its Q statistic exceedsa predefined threshold.

5.3 Data set and Pre-Processing

This section describes the data set used in this study, the pre-processing that hasbeen applied to the data, and the architecture of the detection model.

5.3.1 Data set

The honeypot traffic data used in the analysis in this chapter also comes from theLeurré.com project [9]. Two data sets of traffic data were extracted for the purposeof this study. Traffic data sets that have been used in this study are: Data set Ifor constructing the PCA model (this is the same data set used in Chapter 4); andData set II for evaluating the model. Table 5.1 gives a brief summary of the datasets used.

DataSet

Start Date End Date Packets StandardFlows

ActivityFlows

I 15/09/2007 30/11/2007 839663 562470 5401II 01/12/2007 31/03/2008 2231245 1586715 7343

Table 5.1: Summary of the data sets used in the study.

The first data set was used in the previous Chapter to study the structureof attackers’ activities in low-interaction honeypot traffic. The reliability andaccuracy of the data set encouraged further usage of the same data set in buildingthe PCA profile of honeypot traffic. The data set was adequate in size for buildingthe PCA model [23]and represented a trade-off between performance and size. Thesecond data set, which was used to test the model, was the most recent data atthe time of the study.


5.3.2 Processing the Flow Traffic via PCA

The processing of raw traffic data and the extraction of traffic features were de-scribed in Chapter 4 (Sections 4.2.2-4.2.3). In this section, we describe the method-ology for performing principal component analysis on the correlation matrix ofhoneypot activity flows.

Principal Component Eigenvalues % of Variance Cumulative %

1 5.5825 31.0139 31.01392 2.6255 14.5861 45.60003 2.4126 13.4033 59.00334 1.8860 10.4778 69.48115 1.6941 9.4117 78.89286 1.2803 7.1128 86.00567 0.7635 4.2414 90.24708 0.6804 3.7799 94.02699 0.3351 1.8614 95.888310 0.2855 1.5862 97.474511 0.1657 0.9203 98.394812 0.1145 0.6362 99.031013 0.1090 0.6056 99.636614 0.0645 0.3581 99.994715 0.0007 0.0041 99.998816 0.0002 0.0009 99.999717 0.0001 0.0003 100.000018 7.3812E-15 4.10067E-14 100.0000

Table 5.2: Extracted principal components’ variance.

To calculate the PCs from the correlation matrix, the p-dimensional matrixX = (X1, .., Xp)T is first standardized by:

Cij = Xij − Xi√σi

(5.7)

for i=1,. . . ,p , where Xi is the sample mean and σi is the sample variance for Xi.Let R be the sample correlation matrix of C, then the principal componentanalysis


Z = ATC (5.8)

where A= (A1, ..,Ak) is the matrix of eigenvectors of R, with the firstcomponent equal to:

Z1 = a11C1 + a12C2 + · · ·+ a1pCp (5.9)

Several factors were considered for selecting the number of principal compo-nents (PCs) that are representative of the variables. First, Kaiser’s rule [120, 74]for eliminating PCs with eigenvalue less than one (see Table 5.2 for the eigenval-ues) was considered. Second, an inspection was made of the Scree plot of energycontributed by each PC (see Figure 5.1), where a sharp drop in the curve indicatesa typical cut-off for selecting the correct number of components (between six andseven PCs). Finally, consideration was given to adding the seventh component toachieve 90% of the total variance of the original data for representing the mainspace (90% reflects the total variance after the robustification process, describedin the following section, where extreme activities are eliminated, compared to 85%in the previous chapter).

5.3.3 Robustness

Extracting principal components from a standard correlation matrix is very sen-sitive to outliers, where the resulting principal components might be determinedby their directions. An effective technique for improving the principal componentanalysis and reducing the effect of these outliers is through the robustification ofthe correlation matrix during the model building phase, Phase I. The robustifica-tion works by eliminating observations with large squared Mahalanobis distanceM2 in an iterative process until the data is believed to be clean or the given numberof iterations is reached [77, 65].

Given a p-dimensional random matrix X = (X1, .., Xp)T of n samples, where Xi

is the sample mean and Si is the sample variance of Xi, thenM2, given by equation(5.10), is an ellipsoid in the p-dimensional space which is centered at the meanX, and the distance to its surface is given by M2 values (a constant probabilitydensity contour)[77].

M2i = (Xi − X)T S−1 (Xi − X) (5.10)


0 2 4 6 8 10 12 14 16 180

1

2

3

4

5

6

Component Number

Eig

enva

lue

Figure 5.1: Scree plot of eigenvalues.

The constant probability contour for the distribution of X satisfies M2≤ χ2p(α),

where χ2p(α) is the percentile of a chi-square distribution with p degrees of freedom.

There is α probability that xi falls outside the ellipsoid defined by M2.Setting a threshold for detecting outlying observations based on χ2

p(α) requiresthe distribution of X to be a multivariate normal. However, since we do not makeany assumptions about the distribution of our data, the threshold for the robus-tification process can be determined from the empirical distribution of M2. Therobustification algorithm is detailed in Figure 5.2 and the corresponding Matlabcode is listed in the Appendices.

5.3.4 Setting up Model Parameters

A critical step in designing a detection approach is setting the limit for judgingnew observations, since this has a dramatic effect on the quality of the detection.When the limit value is very small, it will frequently be exceeded, resulting in ahigh rate of false positive alarms, and when the limit is very large the limit willnever be exceeded, resulting in many false negative alarms.

LetX be a data matrix of n samples of p-dimensional data, where X=(X1,..,Xp)T


Input: Xp×n Data matrix of p variables and n observations, N Number ofiterationsOutput: X Data matrix1 c = 0 % Iteration counter

while (c <N)2 {3 c = c+ 14 Estimate the sample mean X of X5 Estimate the sample variance S of X6 Calculate the Mahalanobis distance, M2 in Eq. 5.10, for all

n observations7 Calculate the threshold UCL, based on Eq. 5.148 Find O observations with M2

i > UCL, for i = 1, 2, 3, .., n9 Update data matrix X by trimming observations in O,

new X= X − O10 }

Figure 5.2: Robustification of the correlation matrix through multivariate trim-ming.

is the sample mean vector of X and R is the sample correlation matrix. The sumof the squares of the weighted principal component scores of the last q principalcomponents, the residual space, in detecting outliers is given by:

Qi =p∑

k=p−q+1

Z2ki

lk(5.11)

where q < p and Zki is the score of the kth PC of the ith observation and lk is thekth eigenvalue. When q = p, the previous equation can be represented by thedistance of the ith observation from the mean of the data, which is given by:

M2i = (Xi − X)T S−1 (Xi − X) ≤ UCL (5.12)

Then M2 follows a chi-square distribution χ2, for larger sample size, with p

degrees of freedom [38]. Thus, the upper control limit becomes:

UCL = χ21−α,p (5.13)

where χ2p(α) is the percentile of a chi-square distribution with p degrees of

freedom.


Although we do not make any assumption about the exact distribution of eachof the p variables, and we are only interested in large values of M2, the upperlimit would be computed from the empirical distribution of the M2 population asfollows:

UCL = u+ 3s (5.14)

where u is the sample mean and s is the standard deviation.

Even if X is not normally distributed, setting the control limit as a multiple ofa standard deviation, usually 3, is an acceptable practice and gives good practicalresults [99, 15]. If the data is normally distributed, then over 99.7 percent ofthe data will be under the control limit, or the probability of an observationfalling outside the limit is equal to one out of a thousand (0.001). Alternatively,Chebyshev’s inequality theorem states that regardless of the distribution of dataX, at least 89 percent of the observations fall under the control limit of 3 standarddeviations from the mean [98].

The M2 test is equivalent to using all principal components in Equation 5.11.The M2 test was used in Phase I to clean the data and to reduce the effect ofextreme observations before conducting the analysis and estimating model param-eters.

Square prediction error (SPE), or the Q-statistic, is a test of how a particularobservation fits the principal component model. SPE is calculated from the sumof squares of the residuals and it measures the distance from the observation tothe k-dimensional hyperspace defined by the PCA model. A high value of SPEindicates that the new observation represents a new direction not included in thePCA model. The Q-statistic of the residual space can be represented by the sumof the squares of the weighted principal component scores of the last p−q principalcomponents in Equation 5.11. The upper limit for Q is given by [75]:

Qα = θ1

Cα√

2θ2h20

θ1+ 1 + θ2h0(h0 − 1)

θ21

1h0

(5.15)

where Cα is the normal deviate corresponding to the upper (1− α) percentile,θi =

∑pi=k+1 l

ji , for j= 1, 2, 3, and h0 = 1− 2θ1θ3

3θ22

.

The use of the upper limit in Equation 5.15 assumes that the data is normallydistributed. Alternatively, the upper limit is set based on the empirical distribution


Honeypot Traffic Data

Flow Aggregation&

Feature Extraction PCA Model Extraction Detection

Standardize Observations

Robustness

Extract PCs

Generate Model Parameters

Detect

Basic Flow Extraction

Filtering

Aggregated Flow Extraction

Feature Extraction

New Attack

PCA Model Parameters

Historical Traffic

Test Traffic

Standardized New

Observation

Phase I Phase II

Honeypot Traffic Data

Figure 5.3: Detection model architecture.

of Q. The Q statistic is used in Phase II to detect new attacks in the detectionmodel.

5.3.5 Model Architecture

The architecture of the detection model is depicted in Figure 5.3. As the figureshows, the model consists of three main components:

• Traffic Flow Aggregator: The traffic flow aggregator accepts Argus trafficflows [2], set to 5-minute maximum expiration, and then groups the trafficflows into the activity flows. The newly generated flows, activity flows, arecombined by the source IP address of the attacker with a maximum of 60minutes inter-arrival time between original flows. Internet noise, such asbackscatter, is filtered out in this model.

• PCA Model Extraction: In this case, the PCA profile is built from histor-ical honeypot data. This includes the calculation of the correlation matrix,the extraction of the eigenvectors and eigenvalues, and the generation ofprincipal components.

• Detection: In the detection model, new observations are tested against thepredefined PCA model parameters for detecting new attacks.


The methodology of detecting new attacks in low-interaction honeypot trafficis adapted from multivariate statistical process control (MSPC), a widely usedstatistical technique in monitoring production processes in industry, e.g. chemicalindustries, to detect manufacturing process faults [99]. The proposed detectionmodel is performed in two phases:

• Phase I: Building a PCA profile of the honeypot traffic from historical dataover a defined period of time. This includes the calculation of the correlationmatrix, the extraction of the eigenvectors and eigenvalues, and the generationof principal component scores. Figure 5.4 illustrates the required steps toconstruct the detection model.

Input: Xpxn Data matrix of aggregated flows of p variables and nobservations, N Number of iterationsOutput: X mean ,S variance, E residual space, UCL upper control limit1 Clean the data matrix from extreme observations, as describe in Section

5.3.32 Compute the mean vector X of X3 Compute the variance S of X4 Compute Y standardized observation vectors of X using Eq. 5.75 Estimate the correlation matrix from X6 Calculate the eigenvalues (L) and eigenvectors (A)7 Extract the PC scores using Eq. 5.88 Find the number of significant PCs, or K, and the residual space E using

criteria described in Section 5.3.29 Compute the Q statistics using Eq. 5.1110 Compute the upper control limit, UCL, for judging future attacks using

Eq. 5.14

Figure 5.4: Steps for building the PCA model (Phase I)

• Phase II: Detecting new attacks where new observations are standardized andare projected onto the residuals of the predefined PCA model and then theirSPE values are tested against a predefined threshold. Figure 5.5 illustratesthe required steps to apply the detection model.

To test the model’s ability to detect traffic that is not present in its trainingdata set, proof of concept testing was conducted. This testing consisted of twoparts: manual generation of new traffic not present in the training data set, and


Input: Xpx1 Attack vector, Model parameters, from Phase I(X mean, Svariance, E residual space, UCL upper control limit)1 Standardize the new attack vector using Eq. 5.72 Project the new attack vector Y into E , using Eq.5.83 Compute the Q statistic from Eq. 5.114 if Q > threshold (UCL) then investigate X for being a new attack

Figure 5.5: Steps for detecting new attacks (Phase II).

the testing of the detection model against this new traffic. A set of new trafficwas generated, which consisted of SYN-Flooding attacks using Hping3 [7], NmapSYN scans [62], operating system identification using Xprobe2 [144], and Nessusvulnerability scans [13]. The testing results showed that all new traffic was detectedby our technique as not being seen in the training data and assigned high SPEvalues. In order to confirm that the new traffic did not exist in the original trainingdata set, a new attack model was constructed which included the newly generatedtraffic. When the generated traffic was was projected onto the residual space ofthe new model, the Q values were small and below the threshold value. Whileit is not claimed that the technique is capable of detecting all new attacks, thetesting results confirm that this detection model is capable of detecting traffic thatis either new or not present in the training data set.

5.4 Illustrative Example

To illustrate this methodology of modeling low-interaction honeypot traffic andutilizing principal component residual space in detecting new attacks, the followingexample provides a practical step-by-step demonstration on a sample of honeypottraffic. The sample traffic has been reduced in its feature space from 18 to only10 variables, to simplify both the calculation and demonstration.

At first, let X be the data matrix of the historical honeypot traffic flows, seeTable 5.3. The features used are: the total number of basic flows generated byindividual IPs (V1); total number of open TCP ports targeted (V2); total number ofdistinct open TCP ports targeted ( V3); total number of closed TCP ports targeted(V4); total number of distinct closed TCP ports targeted (V5), total number ofICMP flows (V6), total number of machines targeted (V7), total duration of basicflows (V8); total number of source packets (V9); and summation of inter-arrival

5.4. Illustrative Example 93

times between basic flows (V10).

No. V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

1 1 1 1 0 0 0 1 0.27 2 0.002 2 0 0 0 0 1 1 3.12 4 1.713 46 2 1 42 7 2 2 41.46 52 0.274 24 2 1 22 9 0 2 41.28 30 0.135 6 2 1 4 2 0 2 21.27 14 1107.056 87 66 2 21 1 0 2 1472.31 257 329.597 54 47 2 3 1 4 2 458.22 205 18.808 51 45 2 2 1 4 2 282.44 191 4.539 47 42 2 3 1 2 1 284.62 174 6.5310 12 8 1 0 0 4 2 44.78 35 11.1311 18 16 2 0 0 2 1 91.97 64 5.4412 17 15 2 0 0 2 1 83.39 60 5.2413 39 34 2 3 1 2 1 278.89 143 7.4514 31 26 1 3 1 2 1 285.90 135 14.8215 71 63 2 4 1 4 2 387.55 264 3.7716 15 0 0 15 1 0 1 0.24 15 865.3217 7 4 2 3 1 0 1 147.01 25 137.6118 40 2 1 36 6 2 2 41.76 46 0.0919 4 0 0 4 1 0 2 0.00 4 0.9920 29 22 2 3 1 4 2 136.35 97 5.18

Table 5.3: Sample traffic matrix

The example is divided into two parts: the first part demonstrates the PCAmodel construction, which includes data cleaning, standard principal componentanalysis extraction, and the principal component residual space isolation; and thesecond part shows how new traffic vectors are manipulated and then tested againstthe principal component residual space for new attack detection.

5.4.1 PCA Model Construction

At first, the data set is arranged into a data matrix X as (where the rows representvariables and the columns constitute observations):

X(10×20) = (X1, .., X10)T =

X1,1 X1,2 · · · X1,20

X2,1 X1,2 · · · X2,20... ... ...

X10,1 X10,2 · · · X10,20

(5.16)


Then the mean vector X = (X1, .., X10)T is computed as:

Xi =∑n

j=1 Xij

nfor i = 1, .., 10, j = 1, .., 20

X =

30.050019.85001.35008.40001.75001.75001.5500

205.145790.8500126.2825

(5.17)

Before conducting any analysis on the data set, a cleaning process is neededto robustify the analysis and to reduce the effect of extreme values. Using thealgorithm described in Figure 5.2, an iterative ellipsoidal trimming is applied onthe data set. The iterative trimming uses Maharani’s distance to measure thedistance of observations from the center of the data. Extreme observations arethose that exceed a predefined limit. Let S be the covariance matrix of the sampledata as:

S = 1n−1

∑ni (Xi − X)(Xi − X)T

S =

577.84 461.17 · · · · · · 1901.1 −1240.4461.17 491.4 1903.2 −12119.9289 11.318 43.213 −64.624102.66 −45.989 −68.095 196.359.1184 −15.671 · · · · · · −38.513 −42.50515.487 16.803 70.539 −219.24.9184 2.1395 10.297 4.86436204.2 5801.1 22043 1727.61901.1 1903.2 7539.9 −5333−1240.4 −1211 · · · · · · −5333 93888

(5.18)


The squared Mahalanobis distance M2 of the data set is computed as

M2i = (Xi − X)T S−1 (Xi − X) ≤ UCL, i = 1, ..20 (5.19)

The upper control limit UCL for testing extreme values needs to be estimated.First, the mean (M ) and the standard deviation(SM) of the M2 are calculated.Then the upper control limit UCL is derived from the empirical distribution ofM2 using the following formula:

UCL = M + 3SM (5.20)

Testing M2 values in the vector 5.21 against the threshold value, reveals thatnone of the observations qualify for being eliminated. In the case that some ob-servations were eliminated, then the mean X would need to be recalculated basedon the newly reduced data set.

M2 =

5.1918.059.5517.1414.5217.824.523.335.1211.893.573.692.8314.839.3113.269.976.8714.024.48

≤ 25.5 (5.21)


The next step is to standardize the data matrix to have zero mean and unitvariance. The variance vector σ = (σ1, .., σ10)Tof the data set is calculated,

σ =

24.038322.16750.745212.36462.51051.58530.5104

330.050786.8327306.4119

(5.22)

Then the standardized data matrix Y = (Y1, ..., Y10) is calculated as

Y =

X1−X1√σ1

X2−X2√σ2......

X10−X10√σ10

(5.23)

Table 5.5 depicts the standardized values of the data set. The correlationmatrix R of the data is then computed

R =

1 0.87 0.55 0.35 0.15 0.41 0.40 0.78 0.91 −0.171 0.69 −0.17 −0.28 0.48 0.19 0.79 0.99 −0.18

1 −0.20 −0.18 0.48 0.02 0.47 0.67 −0.281 0.82 −0.19 0.41 0.10 −0.06 0.05

1 −0.18 0.44 −0.16 −0.18 −0.061 0.31 0.03 0.51 −0.45

1 0.21 0.23 0.031 0.77 0.02

1 −0.201


Obs. 1 2 3 4 5 6 7 8 9 10

Y1 -1.209 -1.167 0.664 -0.252 -1.001 2.369 0.996 0.872 0.705 -0.751

Y2 -0.850 -0.896 -0.805 -0.805 -0.805 2.082 1.225 1.135 0.999 -0.535

Y3 -0.470 -1.812 -0.470 -0.470 -0.470 0.872 0.872 0.872 0.872 -0.470

Y4 -0.679 -0.679 2.717 1.100 -0.356 1.019 -0.437 -0.518 -0.437 -0.679

Y5 -0.697 -0.697 2.091 2.888 0.100 -0.299 -0.299 -0.299 -0.299 -0.697

Y6 -1.104 -0.473 0.158 -1.104 -1.104 -1.104 1.419 1.419 0.158 1.419

Y7 -1.078 -1.078 0.882 0.882 0.882 0.882 0.882 0.882 -1.078 0.882

Y8 -0.621 -0.612 -0.496 -0.497 -0.557 3.839 0.767 0.234 0.241 -0.486

Y9 -1.023 -1.000 -0.447 -0.701 -0.885 1.913 1.315 1.153 0.958 -0.643

Y10 -0.412 -0.407 -0.411 -0.412 3.201 0.664 -0.351 -0.397 -0.391 -0.376

Continued..

Obs. 11 12 13 14 15 16 17 18 19 20

Y11 -0.501 -0.543 0.372 0.040 1.704 -0.626 -0.959 0.414 -1.084 -0.044

Y12 -0.174 -0.219 0.638 0.277 1.947 -0.896 -0.715 -0.805 -0.896 0.097

Y13 0.872 0.872 0.872 -0.470 0.872 -1.812 0.872 -0.470 -1.812 0.872

Y14 -0.679 -0.679 -0.437 -0.437 -0.356 0.534 -0.437 2.232 -0.356 -0.437

Y15 -0.697 -0.697 -0.299 -0.299 -0.299 -0.299 -0.299 1.693 -0.299 -0.299

Y16 0.158 0.158 0.158 0.158 1.419 -1.104 -1.104 0.158 -1.104 1.419

Y17 -1.078 -1.078 -1.078 -1.078 0.882 -1.078 -1.078 0.882 0.882 0.882

Y18 -0.343 -0.369 0.223 0.245 0.553 -0.621 -0.176 -0.495 -0.622 -0.208

Y19 -0.309 -0.355 0.601 0.508 1.994 -0.874 -0.758 -0.517 -1.000 0.071

Y20 -0.394 -0.395 -0.388 -0.364 -0.400 2.412 0.037 -0.412 -0.409 -0.395

Table 5.5: Standardized traffic matrix.

The eigen decomposition is then performed on R and the results are rearrangedso that eigenvalue and eigenvector pairs are in descending order of variance. Theextraction of eigenvector and eigenvalues of R are depicted in Tables 5.6 and 5.7respectively.


PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

-0.433 0.232 0.087 0.083 -0.046 -0.272 0.016 0.233 -0.393 0.678-0.461 -0.081 0.122 -0.016 -0.077 -0.086 0.319 0.148 -0.461 -0.646-0.358 -0.144 -0.114 0.188 0.758 0.408 -0.209 0.131 0.033 0.0060.017 0.611 0.023 0.238 0.048 -0.314 -0.507 0.264 0.145 -0.3470.076 0.588 -0.173 0.204 0.267 0.061 0.527 -0.468 -0.068 -0.004-0.273 -0.103 -0.562 -0.383 0.092 -0.423 -0.270 -0.434 -0.037 -0.049-0.144 0.424 -0.149 -0.654 -0.171 0.517 -0.016 0.226 0.041 0.004-0.372 0.069 0.427 0.145 -0.291 0.297 -0.354 -0.596 0.028 -0.004-0.464 -0.016 0.078 0.000 -0.055 -0.205 0.349 0.098 0.776 0.0220.134 0.060 0.639 -0.514 0.468 -0.271 0.013 -0.115 -0.013 0.001

Table 5.6: Eigenvectors.

Eigenvalues Percentage Cumulative Percentage

4.42783 44.27835 44.278352.32673 23.26732 67.545671.38682 13.86817 81.413840.82176 8.21757 89.631400.47378 4.73782 94.369220.33999 3.39988 97.769100.17495 1.74950 99.518600.04514 0.45142 99.970020.00298 0.02976 99.999770.00002 0.00023 100.00000

Table 5.7: Eigenvalues.

E =

−0.046 −0.272 0.016 0.233 −0.393 0.678−0.077 −0.086 0.319 0.148 −0.461 −0.6460.758 0.408 −0.209 0.131 0.033 0.0060.048 −0.314 −0.507 0.264 0.145 −0.3470.267 0.061 0.527 −0.468 −0.068 −0.0040.092 −0.423 −0.270 −0.434 −0.037 −0.049−0.171 0.517 −0.016 0.226 0.041 0.004−0.291 0.297 −0.354 −0.596 0.028 −0.004−0.055 −0.205 0.349 0.098 0.776 0.0220.468 −0.271 0.013 −0.115 −0.013 0.001

(5.24)


Values in Table 5.7 suggest that the main PCA space is represented by the firstfour PCs as the eigenvalues are high, or close to one, and the cumulative percentageof the total variance is close to 90%. Thus, the residual space E, Vector 5.24, isrepresented by the last six PCs.

To calculate the SPE values of the the first phase, which is necessary to drivethe detection threshold, we first need to project the standardized data onto theresidual space. The resulting new PC scores ZR of the projections becomes:

ZR = ETY (5.25)

Table 5.8 depicts the resulting new PC scores ZR of the projection.

Z 1 2 3 4 5 6 7 8 9 10

Z1 -0.329 -1.287 0.205 0.280 1.217 -0.799 -0.058 0.093 0.345 -0.520Z2 0.427 -0.398 -0.582 0.809 0.337 0.687 0.008 -0.038 -0.446 0.174Z3 -0.042 0.061 -0.467 1.011 0.271 -0.555 0.073 0.215 0.430 -0.561Z4 0.232 -0.218 0.142 -0.349 0.000 -0.116 -0.153 0.090 0.134 -0.315Z5 -0.010 -0.054 0.021 -0.058 0.023 -0.032 0.060 -0.001 -0.047 -0.049Z6 -0.004 0.015 0.002 -0.003 0.004 0.000 0.001 0.002 -0.001 -0.007

Continue ..

Z 11 12 13 14 15 16 17 18 19 20

Z1 0.609 0.624 0.414 -0.550 -0.139 0.110 0.851 0.090 -1.560 0.406Z2 0.122 0.139 -0.257 -0.664 -0.462 -1.428 0.800 -0.372 0.780 0.364Z3 -0.285 -0.307 0.191 0.312 0.586 -0.084 -0.248 -0.459 0.250 -0.392Z4 0.027 0.021 -0.022 -0.352 0.340 0.008 0.072 0.135 0.422 -0.098Z5 -0.040 -0.039 -0.027 0.155 -0.017 -0.034 0.095 0.023 0.036 -0.004Z6 -0.001 -0.001 -0.002 -0.004 0.002 -0.006 0.004 0.002 -0.004 0.001

Table 5.8: Scores of the residuals.

The calculation of the SPE statistics is performed using Equation 5.11 as fol-lows:

SPEi =6∑

k=1

Z2ki

lk(5.26)

Table 5.9 shows the resulting SPEs.


Observation No. SPE Value

1 2.63202 15.98373 3.12904 12.12835 4.89376 5.14577 1.85118 0.57379 3.094610 7.654911 1.907212 2.008213 1.128914 14.020315 5.496016 8.078317 7.712018 2.309219 12.371320 1.8821

Total 114

Mean (M) 5.7000

Standard Deviation (STD) 4.6692

Table 5.9: SPE values.

Finally, the upper control limit UCL or threshold, for SPE values becomes:

UCL = M (SPE) + 3 ∗ STD (SPE) = 19.7075 (5.27)

5.4.2 Future Traffic Testing

Suppose that new traffic has been collected, Table 5.10, and that there is a need totest this traffic for either being new or having been seen before by the honeypot.


No. V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

1 87 66 2 21 1 0 2 1472.31 257 329.592 54 47 2 3 1 4 2 458.22 205 18.803 37 21 3 15 1 0 2 575.42 88.00 140.344 139 105 2 34 1 0 2 2911.23 473 323.84

Table 5.10: Future traffic matrix.

First, the new traffic is standardized using the mean X and variance σ fromthe first phase, Vectors 5.17 and 5.22. Using Equation 5.23, the standardized newtraffic of the first observation, Y1, becomes:

Y1 = X1 − X1√σ1

=

87−30.05√24.0383

66−19.85√22.1675

2−1.35√0.7452

21−8.4√12.3646

1−1.75√2.5105

0−1.75√1.5853

2−1.55√0.5104

1472.31−205.15√330.0507

257−90.85√86.8327

329.59−126.28√306.4119

=

2.3692.0820.8721.019−0.299−1.1040.8823.8391.9130.664

(5.28)

Table 5.11 illustrates the standardized new traffic vectors.The next step is to project Y onto the residual space E, see Equation 5.24.

The new PC scores ZF of the projections then become:

ZF = ETY

Table 5.12 shows the PC scores of projecting new observations onto the residualspace.

Then, the new SPE values of the projected traffic are computed using Equation5.11 as:

SPEi =6∑

k=1

Z2ki

lk(5.29)


No. 1 2 3 4

Y1 2.369 0.996 0.289 4.532

Y2 2.082 1.225 0.052 3.841

Y3 0.872 0.872 2.214 0.872

Y4 1.019 -0.437 0.534 2.070

Y5 -0.299 -0.299 -0.299 -0.299

Y6 -1.104 1.419 -1.104 -1.104

Y7 0.882 0.882 0.882 0.882

Y8 3.839 0.767 1.122 8.199

Y9 1.913 1.315 -0.033 4.401

Y10 0.664 -0.351 0.046 0.645

Table 5.11: Standardized future traffic matrix.

Z1 Z2 Z3 Z4

-0.7987 -0.0580 1.0517 -2.39250.6866 0.0076 1.8852 0.4087-0.5546 0.0734 -0.9945 -1.1682-0.1163 -0.1529 0.6462 -1.4274-0.0324 0.0601 0.1153 0.5134-0.0002 0.0013 0.0443 0.0029

Table 5.12: New traffic PC scores.

SPE =

5.14571.8511

119.3752154.4763

(5.30)

Finally, the SPE values of the new observations are compared with the thresh-old values, computed using Equation 5.9 in stage one. The SPE values of observa-tion 3 and 4 exceed the threshold of 19.7075 and are considered new attacks. Ascan be seen from Table 5.3, the first two observations of the traffic set have beenseen before and are included in the training data, while the last two observationsare new.

5.5. Results and Evaluation 103

0 5 10 15 200

20

40

60

80

100

120

140

160

Observations

SP

E M

agni

tude

SPE Limit

Training Data Set Testing Data

Figure 5.6: Plot of SPE values of the training and testing traffic.

5.5 Results and Evaluation

Two and a half months of real attack data were used to build the detection modelin Phase I using Data set I in Section 5.3.1. Parameters from this phase were thenused to detect future attacks. The following subsections provide analysis of thedetection technique in terms of results, stability and performance and evaluationof the detected results.

5.5.1 Detection and Identification

Four months of real attack evaluation data were extracted from the honeypotenvironment, Data set II in Section 5.3.1, and projected onto the residual space ofthe detection model. Figure 5.7 illustrates the SPE values of projection.

The projection shows observations that had high SPE values as spikes thatrose above the threshold value. These observations were possibly new attacksand required further investigation. As the figure shows, there were 81 observa-tions flagged by our detection algorithm which violated the structure of the attackmodel. Moreover, the figure shows intense attack activities obvious along the Xaxis around observation 5000. These activities reflect a single class of attack that


0 1000 2000 3000 4000 5000 6000 70000

5

10

15

20

25

30

35

Observations

Ma

gn

itud

e

Figure 5.7: Plot of four-month attack data projected onto the residual space.

one of our honeypot sensors experienced in late February and early March of 2008.Details of these attacks and the rest of the attack activities are discussed in Section5.5.4.

5.5.2 Stability of the Monitoring Model Over Time

As our technique used a historical block of data to construct the PCA detectionmodel, it is very important to evaluate the stability of the PCA model over time.Preliminary investigation, which included the stability of the estimated mean vec-tors and the correlation matrix, suggested that there was a slight variation in someof the variables’ means and also in the amount of variance contained in the firstseven components. The number of significant components was still the same forboth of the data sets, however the amount of variation in the first seven compo-nents had increased slightly in data set II. These changes were not significant andhad little effect on the residual space. However, in the next chapter, Chapter 6,the current design is developed further, and an adaptive real time detection modelis devised that updates its parameters automatically over time to incorporate anychanges in the traffic.

5.5. Results and Evaluation 105

5.5.3 Computational Requirements

The detection model was developed using a combination of the open -ource pro-gramming language Python [16] and the high level scientific computing environ-ment Matlab [10]. The Python language was utilized in developing the flow ag-gregator while Matlab was used in the development of the detection engine. Tasksrequired by the detection model include: standardizing the data, calculating thecorrelation matrix, finding the eigenvectors and the eigenvalues, extracting the PCscores, and computing the M2 and Q statistics.

If X is a data matrix of n samples of p dimensional random variables, then thecomputational requirements for computing the correlation matrix of X is O(np2)and the extraction of the eigenvectors and eigenvalues are O(p3) [125].

Tasks involved in the detection Time

Calculating the correlation matrix for all data 0.070073Finding the eigenvectors and eigenvalues of the correlationmatrix

0.001322

Calculating the PC scores of X 0.003049M2 Test of one vector for all PCs (18 PCs) 0.000142Q Test of one vector for the residual (11 PCs) 0.000088

Table 5.13: Average execution times of the major tasks (seconds).

The computational requirements are mainly matrix manipulations and are notconsidered expensive when taking into account the massive reduction in datarecords from using our aggregation technique of flows. Moreover, to detect newattacks, only the Q statistic needs to be calculated using parameters from the de-tection model, Phase I. Table 5.13 shows the empirical execution times required bya number of components of the detection model. The detection model was testedon a personal computer with a 2.0 GHz Intel dual core processor and 2GB of RAM.

5.5.4 Evaluation

In this section, we detail our evaluation methodology of the detection model usingthe data set described in Section 5.3.1.To help better understand the nature of the detected observations and judge theirsignificance, a manual inspection was carried out for every observation that was


Activities Class Distinct Behaviors Possible Type No.

Worm Activity I Moderate: TF, TCP_OLow: TCP_C, IAT

W32.Rahack.W worm 26

Worm Activity II Moderate: TF, TCP_CLow: TCP_O

Mydoom worm family 6

Worm Activity III Moderate: TF, TCP_CHigh: AVG_PK_SIZE

Bobax worm family 1

Worm Activity V Low:TF High: IAT Backdoor.Evivinc 2

Denial of Service High: TF, Dur, TCP_O,SPackets,T_ACT Short:IAT

Distributed DOS or DOS 21

Scan Activities Large: TF, TCP_C,SPackets Moderate:TCP_O, ICMP

Horizontal scan or machinedetection

2

Misconfiguration Low: TF,UDP_C DHCP request 8

Miscellaneous LOW: TF,TCP_C,UDP_C

Unknown 15

Table 5.14: Classes of detected attack activities.

flagged by our detection algorithm, 81 observations in total. The aim was toexplain the reasons for these observations being flagged by the detection algorithmand to group them according to their similarities into different classes. The processconsisted of manual inspection and manual classification of these detected attackobservations. Firstly, we examined all of the 18 traffic features that had originallybeen used by the algorithm, and then went further by checking the basic flows forother patterns of attack, such as destination ports, protocols, and flags. Moreover,flagged observations were also checked against the original traffic logs. Secondly,observations were grouped together into different types of attack clusters, basedon their attack port similarities, or port sequences. The port sequence is a listof targeted honeypot ports that are generated by a single IP address during theattack period (See Section 3.1.2.1).

The manual inspection of the detected traffic found eight clusters of attackactivities. Table 5.14 provides a brief summary of the clusters.

As Table 5.14 shows, there are four types of activities that were classifiedas worm attacks. The first class, Worm Activity I, was the largest with a port

5.6. Summary 107

sequence (T139, T445, T9988, ICMP). This class represents repeated attemptstargeting two open TCP ports, 139 and 445, and a single TCP closed port, 9988.These activities resemble the well known Rahack worm [123], which targets Mi-crosoft OSs.

The second class of worm activities was distinguished by its port sequence(T1080, T3128, T80, T8080, T10080, ICMP). The pattern of these activities’ portsequence is the same as for the Mydoom worm family [121]. The third class ofworm activity, port sequence (T445, T135, T1433, T139, T5000), is another typeof automated exploit that targets a Microsoft Windows LSASS vulnerability [39].The last worm class of activity targeted TCP port 5900. This class of activitiesis mainly scans for Trojans that listen for remote connections on TCP Port 5900,such as Backdoor.Evivinc [122].

The denial of service activities class came second in terms of number of obser-vations. The attacking IPs targeted a single machine on a single open TCP port,port 80, with very short time between packets. These attacks were detected byour algorithm because the total activities of the source IP were huge, in additionto other parameters such as number of source packets sent. The attack was mainlycaused by a few IP addresses over the period from 20/02/2008 to 04/03/2008.

The third class of activities that was detected by our model was scan activities.While low to moderate scanning activities were very common in our log files, theseactivities were flagged by our algorithm since they generated large values on singleor multiple features.

The misconfiguration class of activities was mainly a DHCP request on UDPport 53. The last class of activities, miscellaneous, consisted of all observationsthat we were not able to explained and which did not fit in any other class. Thisclass of activities represents short attacks on non-standard single TCP, single UDPports or both.

5.6 Summary

This chapter has presented a technique for detecting new attacks in low-interactionhoneypot traffic. The proposed detection is performed in two phases. Firstly, anattack model is constructed and model parameters are estimated using principalcomponent analysis from historical honeypot traffic. Secondly, new traffic vectorsare projected onto the residual space of the PCA model, from the first phase, and


their square prediction error (SPE) statistics are computed. Traffic vectors areflagged as being new attacks if their SPE values exceed a predefined threshold.Traffic that has a large SPE value represents a new direction that has not beencaptured by the PCA attack model and which needs further investigation.

The effectiveness of the proposed technique is demonstrated through the analy-sis of real traffic data from the Leurré.com project. Results of the evaluation showthat this technique is capable of detecting different types of attacks with no priorknowledge of these attacks. In addition, the technique has a low computationalrequirement, which makes it suitable for on-line detection systems.

The promising capability of the proposed techniques, both in detecting newattacks and in requiring low computational resources, motivated our investigationsinto improving the technique further to suit an on-line monitoring system. Furtherinvestigation was required in order to overcome some of the limitations identifiedin the work described in this chapter, namely:

• the need for manual extraction of the model parameters, such as the numberof PCs required by the main PCA model and the residual space;

• the use of a static PCA attack model to detect new attacks since it is builtfrom a fixed block of historical data, and

• the lack of a mechanism for improving the detection technique by inspectingtraffic that exhibits high SPE values, either because it is new to the PCAmodel or because it is an extreme example of traffic that has been previouslyobserved by our honeypot.

The next chapter describes how to overcome these limitations, how to automatethe detection model to adapt to the dynamically changing nature of Internet attacktraffic, and the implementation of a proof of concept detection system.

Chapter 6

Automatic Detection of NewAttacks

Detecting emerging Internet threats in real time presents several challenges, whichinclude the high volume of traffic and the difficulty of isolating legitimate frommalicious traffic. As previously noted, an efficient way of collecting and detectingthese threats is through deploying honeypots, since they are decoy computers thatrun no legitimate services and any contact with them can therefore be consideredsuspicious.

In the previous chapter, a technique was proposed for detecting new attacks inlow-interaction honeypots using principal component analysis (PCA). The tech-nique flags new attacks by detecting changes in the residual of the PCA modelspace. While the technique is very efficient in detecting these changes, it suf-fers from several limitations that make it inefficient for the real-time detectionof anomalous honeypot traffic. In addition, Internet traffic is very dynamic andchanges very rapidly, which necessitates a real-time detection model capable ofcapturing these changes automatically.

In this chapter, these limitations are addressed, and a real-time adaptive de-tection model is proposed that automatically captures new changes and updatesits parameters automatically. The main contributions of this chapter include:

• a method for automatic extraction of model parameters, such as number ofcomponents that are representative of the main PCA space and the residualspace, and threshold values;

109

110 Chapter 6. Automatic Detection of New Attacks

• a method for automatic differentiation between two types of activities thatexhibit high SPE values as either genuinely new activities or extreme ex-amples of some of the existing activities that have been observed by ourhoneypot before;

• a method for automatic update of the model correlation structure withoutthe need to retain the old traffic data, based on the work of Li et al. [94];and

• a proof of concept implementation of the proposed detection system for real-time and offline applications.

The remainder of the chapter is structured as follows. Section 6.1 provides anintroduction and discusses the motivation behind this work. Section 6.2 details themethodology of constructing the attack model using principal component analysis.The model architecture is described in Section 6.3. Section 6.4 presents a proofof concept implementation of the proposed attack detection model. Experimentalresults are discussed in Section 6.5. Finally, Section 6.6 summarizes the chapter.

6.1 Introduction

No matter how extensive the training data that is used to extract the detectionmodel, this data has a limited scope of attack space and cannot be consideredrepresentative of the entire attack space. In addition, the attack space is verydynamic and changing, where new attacks are reported every day. A major limi-tation of building an attack model based on historical data is the production of afixed model. One consequence of this limitation is the high number of false alarms,or previously seen attacks continuing to be identified as being new.

A reliable traffic detection model is required to capture new changes in Internetthreats and to adapt to these changes automatically, which, in the context of ourproposal, involves the following:

• automatic update of the mean and standard deviation vectors, and the cor-relation matrix;

• automatic extraction of the model parameters, such the number of PCs thatare representative of the main PCA model and the residual space; and

6.2. Principal Component Analysis Model 111

• automatic adjustment of the threshold values for flagging new attacks andeliminating extreme observations.

When a new block of traffic data becomes available, the PCA model needs to beupdated using all accumulated traffic data up to that point in time, which requiresstoring all historical traffic data. Although this methodology is correct and worksfor small data sets or for short intervals, accumulating and handling large amountsof traffic in terms of storage and computational requirements is very difficult.Different methods exist for updating models over time, such as the exponentiallyweighted moving average (EWMA) and the moving windows schemes. However,EWMA uses a forgetting factor, or decay factor, that gives more weight to recentdata while the weight of old data declines over time [99]. While this method worksfor many applications, it does not suit our need of detecting new attacks, becauseold attack data is neglected over time.

Alternatively, the detection model could be updated recursively giving equalweight to old and new data. This methods accounts for all data and only requiresthat the most recent block of data be retained. A complete recursive algorithmfor updating a generic PCA model recursively is described by Li et al. [94] andwill be utilized to update the proposed detection model.

6.2 Principal Component Analysis Model

The proposed detection model is performed in two stages. The initial stage involvesstandard PCA model extraction as described in the previous chapter, and thesecond stage performs recursive PCA for adaptive real-time monitoring. In thenext sections, these two stages are described in detail.

6.2.1 Building the Initial PCA Detection Model

The methodology for building the attack model includes accumulating historicaltraffic data and then building the initial detection model out of this block oftraffic data. The main steps (from Chapter 5) are briefly summarized here. First,historical honeypot traffic is grouped as the initial traffic data set block. Let Xbe this initial data matrix of n observations of p variables, where X=(X1,..,Xp)T

is the sample mean vector of X, σ = (σ1, .., σp) is the sample standard deviationvector of X, and R is the sample correlation matrix. Then, principal componentanalysis can be expressed as:


Z = ATX (6.1)

where Z is the principal scores of projecting observations in X onto the eigenvectormatrix A.

The previous equation can be represented in the original coordinates, by pro-jecting Z back into A, then X becomes:

X =k∑i=1

AiZi + E = X + E (6.2)

where the residual matrix E represents the difference between X and X

E = X − X (6.3)

The Q-statistic, or the square prediction error , is defined as [75]:

Q = ETE (6.4)

where the square prediction error measures the sum of squares of the distance ofE from the main space defined by the PCA model.

Alternatively, the square prediction error can be calculated as :

SPE =p∑

i=k+1

Z2i

li(6.5)

An observation is considered new to the PCA model if its Q-statistic exceeds apredefined threshold limit.

Finally, the upper control limit UCL or threshold, for SPE value is computedfrom the empirical distribution of the historical data set as :

UCL = M (SPE) + 3 ∗ STD (SPE) (6.6)

where M is the mean and STD is the standard deviation.

6.2.2 Recursive Adaptation of the Detection Model

In the previous section, it was shown how the initial detection model was con-structed from a block of historical data X. To update the PCA model, when anew block of data becomes available, the following model parameters need to be

6.2. Principal Component Analysis Model 113

recursively updated: the mean vector, the standard deviation vector, the correla-tion matrix, and the threshold. Our notation in recursive updating of the PCAdetection model conforms to the work described by Li et al. [94], who proposed arecursive PCA algorithm for adaptive process monitoring.

Let X1 be the first block of data of n1 observations, that is used to build thedetection model, then the sample mean vector X1 becomes:

X1 = 1n1

(X1)T I1 (6.7)

where I1(n1,1) = (1, 1, .., 1)T .

Then, the standardized data matrix Y1 becomes:

Y1 = (X1 − I1XT1 ) Σ−1

1 (6.8)

where Σ1 = diagonal (σ1.1, .., σ1.p) of sample standard deviation vector σ1 = (σ1.1, .., σ1.p)which can be estimated as:

σ21.i = 1

(n1 − 1) ||X1(:, i)− I1X1(i) ||2 (6.9)

where X1(i) is theith column of the matrix X1.

The correlation matrix R1 is calculated using:

R1 = 1(n1 − 1) Y

T1 Y1 (6.10)

Let XK be the current data block of nk observations, which has been usedto estimate the detection model, with sample mean vector Xk, sample standarddeviation σk, standardized data matrix Yk, and sample correlation matrix Rk. Toaugment a new block of data XK+1 to the current model, the recursive calculationsof the new model parameters Xk+1, σk+1, Yk+1 , Rk+1 become:

Xk+1 = Nk

Nk+1Xk + 1

Nk+1(Xk+1)T Ik+1 (6.11)

where Ik+1(nk+1,1) = (1, 1, .., 1)T and Nk = ∑ki=1 ni.

The standard deviation vector σ1 = (σ1.1, .., σ1.p) is estimated as:


σ2k+1.i = σ2

k.i + Nk

(Nk+1 − 1)∆X2k+1(i) + 1

(Nk+1 − 1) ||Xk+1(:, i)− Ik+1Xk+1(i) ||2

(6.12)where ∆Xk+1 = Xk+1 − Xk

The standardized data matrix Yk+1 becomes:

Yk+1 = (Xk+1 − Ik+1XTk+1)

∑ −1k+1 (6.13)

And the correlation matrix Rk is computed as:

R1 = Nk − 1Nk+1 − 1

∑ −1k+1

∑kRk

∑k

∑ −1k+1 (6.14)

+ Nk

Nk+1 − 1∑ −1

k+1∆Xk+1∆XTk+1

∑ −1k+1

+ 1Nk+1 − 1Y

Tk+1Yk+1

6.2.3 Setting the Thresholds

Two types of limits have been utilized in building the detection model: Maha-lanobis distance limit for eliminating extreme observations and for robustifing theanalysis; and SPE limit for detecting new attacks. As mentioned previously inChapter 5, no assumptions are made here about honeypot traffic distributions.This leads us to derive the robustification and detection limits from the empiricaldistribution of the historical data set. The SPE and Mahalanobis upper controllimits were derived according to the following equations (3-Sigma):

UCL(SPE) = M (SPE) + 3 ∗ STD (SPE)UCL(T 2) = M (T 2) + 3 ∗ STD (T 2)

(6.15)

where T 2 is equivalent to Mahalanobis distance for n = 1 and henceforth will beused in describing this statistic, M is the mean, and STD is the standarddeviation.

When new traffic data becomes available and the criteria for updating the PCAmodel are met, based on number of days or number of packets, the PCA model isrecursively updated. As a result, the control limits change, and this necessitatesthe update of these control limits every time the PCA model is updated. The

6.3. Model Architecture 115

New Attack Detection

New Attacks

Repository

PCA Model Generation

Flow Aggregation & Feature Extraction

Detection Parameters

Real-TimeHoneypot Traffic

New Attack

Figure 6.1: Adaptive detection model process flow.

recursive calculation of these control limits conforms to Equation 6.11 for recursivecalculation of the mean and and Equation 6.12 for recursive calculation of thevariance.

6.3 Model Architecture

Detecting new attacks in low-interaction honeypot traffic is achieved by detectingchanges in the PCA model. The detection of these changes is achieved by astatistical test of the new traffic projection onto the predefined PCs’ residualsusing the square prediction error (SPE). Traffic that violates the structure of thePCA model produces high SPE values and represents a new direction that has notbeen captured by the detection model.

In the previous chapter, a model was presented that utilizes the PCA and SPEin detecting new attacks. However, the model is static and does not anticipate newchanges in the monitored traffic. For real-time monitoring of honeypot traffic, themodel needs to feedback any new changes that might occur and recalculate themodel parameters accordingly. Figure 6.1 shows the process flow of the proposedadaptive attack model.

As the figure shows, the system consists of three main functions:

• Traffic Flow Aggregator: the traffic flow aggregator accepts Argus trafficflows, set to 5 minutes maximum expiration, and groups them into what wecall activity flows. The newly generated flows are combined by the sourceIP address of the attacker with a maximum of 60 minutes inter-arrival timebetween the original flows, as generated by Argus.

• PCA Model Generator: the PCA model generator consists of two compo-


nents: the initial PCA and the recursive PCA model generators. The PCAmodel generator provides tools for the analysis of honeypot traffic and thegeneration and update of the principal component analysis model.

• New Attack Detection Engine: the detection engine works in two modes, thelive mode where traffic flows arrive in real time from the honeypot and theoffline mode where traffic flows are read from a file.

6.3.1 Detecting New Attacks and Updating the Model

While our main detection criterion is observations with high SPE values, our ex-periments show that some of the extreme observations also have high SPE values.Since the aim is to detect only new attacks, traffic is verified using T 2 statistics,which conform with the following:

T 2 = n(Xi − X)T S−1 (Xi − X) (6.16)

where n=1 for testing a single observation. Traffic that has high T 2 values isconsidered extreme, existing traffic that has high values across some or all thevariables, and is eliminated.

New traffic is first tested using Equation 6.16 and is discarded if its T 2 statisticexceeds the predefined threshold (see Section 6.2.3). Figure 6.2 illustrates the newattack detection steps. As the figure shows, only attack traffic with a low T 2

statistic is retained to further update the model, while traffic with high a SPEvalue is flagged as being a new attack.

New Traffic

High T2

High SPE

Yes

No

Yes

New Attack

Compute T2

Compute SPE

Accumulate Traffic for Model

Update

No

Discard Traffic

Figure 6.2: Detecting new attacks.

6.3. Model Architecture 117

6.3.2 Model Sensitivity to New Attacks

The residual space is very sensitive to new traffic not present in the main PCAmodel during the model building phase. As a result, this new traffic would resultin the production of high SPE values. However, when these attacks are consideredduring the model update stage and the model parameters are recalculated, theresidual space’s sensitivity decreases dramatically and eventually approaches zeroover time as the frequency of inclusion of these attacks in the main model increases.

0 1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

140

Observations Occurance

SP

E M

agni

tude

SPE Value

SPE Limit

Figure 6.3: Residual space sensitivity to new attacks.

Figure 6.3 illustrates the sensitivity of the residual space to a single new attackvector (see Section 6.6 for more details of this example). The first projection ofthe new observation has produced a high SPE value (119.4). Then, to test thesensitivity of the residual space to a new attack, the new attack is included inthe training data for the first time and the PCA model is recalculated. The sameattack projection onto the updated residual space has generated a very low SPEvalue (14.6). Further inclusion of the same attack into the PCA model results indeclining SPE values, which reach the value of 0.26 after the 10th inclusion.

The previous trial indicates that the inclusion of the new attack in the modelfor the first time would lower the SPE values of similar attacks. However, as theprevious illustration shows, the new attack’s SPE value is still high after the firstinclusion, but it decreases further after several inclusions in the PCA model. Inconclusion, this trial supports our decision to adapt the detection model recursively


and periodically in a way that accounts for historical data.

6.4 A Proof of Concept Implementation

A proof-of-concept system was implemented for aggregating flow, analyzing, visual-izing, and monitoring honeypot activities. The following sections present differentaspects of this system in more detail.

6.4.1 Flow Aggregator

The flow aggregator is a utility that was developed using the Python language,to filter and aggregate basic flows into activity flows, by combining basic flows bysource IP address of the attacker with a maximum of 60 minutes inter-arrival timebetween basic flows. The flow aggregator works in two modes: offline aggregationfor use with traffic log files and live aggregation to support on-line monitoring ofhoneypots.

The flow aggregator accepts Argus Client flows [2] and then processes them toproduce activity flows. In the live mode, the flow aggregator creates a socket andlistens on TCP port 6000 for connections from the monitoring station, where flowsare exported in real time as they arrive from the Argus client.

In addition, the flow aggregator, in offline mode, generates detailed informationto help the system operator interpret the detected observation. The detailed filecontains additional information for every generated activity flow, which includesa list of the original basic flows, start and end times of the attack, protocols andports that are targeted.

6.4.2 Monitoring Desktop: HoneyEye

The HoneyEye monitoring system provides a complete system for analyzing anddetecting attacks in a low interaction honeypot environment. The HoneyEye sys-tem was developed using the high-level scientific computing environment Matlab[10]. The selection of Matlab for building the system was due to its capabilitiesin matrix manipulation, statistics, and its general mathematical capabilities. Thesystem implements three models: PCA extraction, residual analysis, and monitor-ing (see Figure 6.4).

The PCA extraction model provides the following functionality:

6.4. A Proof of Concept Implementation 119

• it imports activity files into the system;

• it provides a robustification mechanism to clean the log files of extremeactivities that might mislead the analysis;

• it performs standard PCA extractions with the capability to inspect theextracted eigenvalues, eigenvectors, and principal component scores; and

• it provides visual tools to inspect the scree plot and pairs of principal com-ponents.

The residual analysis model contributes the following to the analysis:

• it provides a means to inspect and test the automatically generated numberof components that constitutes the main PCA model and eventually theresidual space;

• it provides visual inspection of the SPE values and the threshold;

• it provides the capability to select different threshold values based on: three-sigma, Chi-Square or user defined threshold; and

• it saves the results in a file to be used by the monitoring model.

The monitoring model has the following capabilities. It:

• Imports saved detection parameters, log traffic to be tested and an investi-gation file for further interpretation of the detected attacks;

• Provides offline detection of attacks from log files;

• Monitors remote honeypots in a real-time mode through a TCP/IP network;

• Adapts to new change in traffic, based on number of packets or number ofdays;

• Provides visual projection of the attack that shows the SPE values and thethreshold limit; and

• Provides an investigation panel that depicts detailed information of the de-tected attacks.


Figure 6.4: HoneyEye interface.

6.4.3 Deployment Scenario: Single Site

The detection system works in two modes, real-time monitoring of traffic andoffline analysis based on logs of collected data traffic. The system architectureconsists of two parts: the honeypot sensor and the monitoring station. Figure6.5 illustrates an overview of the system components of the proposed deploymentarchitecture.

The honeypot sensor is based on the open source low-interaction Honeyd [22].It runs on a single Unix host and emulates three operating systems at the sametime. Also, on the same machine, Argus [2] is configured to capture all packetssent to and from the honeypots when there is interaction with attackers targetingits IP addresses.

The monitoring station (HoneyEye) provides a complete system for analyzingand detecting attacks in a low interaction honeypot environment (See Section6.4.2). It connects to a remote honeypot sensor in a real-time mode through aTCP/IP network and detect new attacks.


Internet

Router

Switch

Firewall

��

Monitoring Station

Internal Network

Windows 2000 Prof .

Windows 2000 Serv.

Unix

Honeypot Sensor

Virtual Honeypots (Honeyd)

Argus Client /Server

Honeypot Host (Unix)

Honeyd

Real Time Honeypot

Flow

Monitoring

Flow Aggregator

Detection

PCA Model Generation

Figure 6.5: Overview of a real-time deployment.

6.4.4 Limitation

There are some limitations with this approach. Firstly, the current implementationof the system only monitors one honeypot environment at a time in real-time mode.However, the system can accept more than one log file and integrate the resultsautomatically in offline mode. Secondly, the performance of manually inspectingindividual observations becomes slow as the investigation file becomes bigger. Afurther improvement would be to store these details in a local database which willimprove the system performance significantly. Thirdly, the current implementationprovides two ways of updating the model, based on number of days and numberof packets. While these two criteria are useful in adapting the model, a newand robust means of updating the model needs to be developed to determine theright point in time to do this. Finally, the current system does not provide anyclassification mechanism of the detected attacks.

6.5 Experimental Results

To test the proposed detection method, two data sets were used: a training dataset to establish the initial principal component analysis model and extract theinitial detection parameters; and a testing data set to illustrate the efficiency ofthe adaptation methodology (see Table 6.1 for more details of these data sets).Both of the data sets used in this analysis came from the Leurré.com project. The


raw traffic data was then processed, according to the steps described in Section5.3.2, to extract activity flows of honeypot traffic, since activity flows are the basicinput to our analysis. The training data set used in this chapter is identical to thetraining data used in the previous chapter. This data set proved to be reliable, asthe sensor had been up and running during the whole period, and was adequatein size [23]. Since our aim was to test the detection model’s capability to captureand adapt to changes in honeypot traffic, the use of this old data set was justifiedfor initializing the PCA model. The testing data was a combination of the testingdata used in Chapter 5 and the most recent data available when this researchwas conducted. The purpose of the testing in this chapter is to discover if attackactivity patterns change over time. Therefore, the continuity of the testing dataset was the primary concern.

Data Set Start Date End Date Packets StandardFlows

ActivityFlows

ModelBuilding

15/09/2007 30/11/2007 839663 562470 5401

ModelTesting

01/12/2007 31/07/2008 2892182 1914719 13293

Table 6.1: Summary of the data sets.

The initial PCA model was constructed according to the procedure discussedin Section 6.2.1. The initial extraction of the eigenvectors and eigenvalues of thecorrelation matrix R resulted in identifying the two PCA spaces: the main PCAspace, which consisted of the first seven PCs with 90% cumulative percentage ofthe total variance, and the residual space E, which was represented by the lasteleven PCs. The threshold limit for SPE was set to 243.296 while the T 2 limit wasset to 387.3372, obtained according to Equation 6.15 .

In the following subsections, we present a discussion of the results obtained byprojecting eight months of real traffic data onto the initial detection model. Atfirst, the data was projected without updating the model (no adaptation), andthen, the data was projected with the model being updated every 14 days (14-dayadaptation). The purpose of doing this was to test the model’s capabilities incapturing and adapting to changes in traffic data to eventually reduce the number


of false alarms. The evaluation of the actual detection technique is discussed inChapter 5.

6.5.1 Projection of the Testing Data: No Adaptation

Projection of the testing traffic data, with no adaptation, is shown in Figure (6.6).The figure shows the SPE values of projecting the traffic flows onto the residualspace, in the upper part of the figure, where 51 observations show high SPE values,while at the same time having low T 2 values, that penetrate the upper control limit.

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

Observations

SP

E M

agni

tude

(lo

g)

Projection of Observations -- No Adaptation

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

Observations

T2

Mag

nitu

de (

log)

Projection of Observations -- No Adaptation

Limit

AttacksNew Attacks

Limit

AttacksExtremes

Figure 6.6: Detection charts, with no adaptation: using SPE statistics (upperchart) and using T2 statistic (lower chart).

The lower part of the figure shows the T 2 statistic for examining extremeobservations. As the chart shows, there are 111 observations with high T 2 values.The introduction of the T 2 statistic, as shown in the lower part of the figure,demonstrates its importance in improving the detection technique capability andin differentiating between attacks that exhibit high SPE values while at the sametime having high T 2 values. The T 2 statistic has helped in isolating genuinely newtraffic in the PCA model from existing traffic with extreme values.


6.5.2 Projection of the Testing Data: With Adaptation

Projection of the same data once more, but with 14-day adaptation is shown inFigure 6.7. The selection of 14 days to adapt the detection model was experimentalin the absence of other information. However, it gave good adaptation results andrepresented a trade-off between traffic size and the computational power requiredto handle it. Determination of criteria for selecting an optimal value for adaptingthe model is left for future work.

As the figure shows, observations with high SPE values are reduced and areevenly distributed over the monitored period with a total of 30 observations (re-duced from 51). Also, Figure 6.7 shows the elimination of several clusters ofactivities along the X axis in Figure (6.6) around observations 4,000, 7,000, 10,000and 13,000, as they are caused by fewer attacks, which had been accounted for bythe model after the adaptation process.

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

Observations

SP

E M

agni

tude

(lo

g)

Projection of Observations -- 14 Days Adaptation

0 2000 4000 6000 8000 10000 120000

5

10

15

20

25

30

35

Observations

T2

Mag

nitu

de (

log)

Projection of Observations -- 14 Days Adaptation

Limit

AttacksNew Attacks

Limit

AttacksExtremes

Figure 6.7: Two detection charts with 14-day adaptation: using SPE statistics(upper chart) and using T2 statistic (lower chart).

In terms of the T 2 statistic, the number of extreme observations, with high T 2

values, increased from 111, before the adaptation, to 182 observations, after theadaptation. The increment in the number of observations with high T 2 values ismainly caused by the decrease in the T 2 threshold value and partially due to the


changes in the PCA correlation structure as a result of augmenting more data,which includes new attacks, and the decrease in most of the variable means.

The adaption process shows the importance of augmenting new data in reduc-ing the number of new attacks, as represented by SPE statistic. The increase inthe number of extreme activities, as represented by the T 2 statistic, indicates anInternet phenomenon, such that most of the attacks against systems are not new,but rather a repetition of old attacks.

6.5.3 The Effects of Adaptation on Threshold Values

The effect of adaptation on SPE and T 2 thresholds is shown in Figure 6.8. Thefigure shows a trend of decreasing thresholds for both statistics. The decreasingSPE threshold is mainly caused by the augmentation of detected attacks, whichis reflected on the correlation structure of the data. The decrease in T 2 thresholdis also a result of the change in the correlation structure and the reduction invariables’ means.

0 2 4 6 8 10 12 14 16 18100

150

200

250

300

350

400The SPE and T2 Limits Over Time: Adapt every 14 Days

Thr

esho

ld V

alue

Limit Over Time

T2

SPE

Figure 6.8: SPE and T2 limit evolution over time using 14-day adaptation.

6.5.4 The Effect of Adaptation on Variables

The effect of adaptation on variables is depicted in Figure (6.9). As the figureshows, all variables show declining trends of their mean values, except for theaverage packet size. This declining trend in most of the variables reflects the


growing size of the data set, where most of the extreme values are systematicallyreduced, which is desirable for the stability of the detection model. This trend isexpected to flatten as the number of augmented blocks of data increases.

0 2 4 6 8 10 12 14 16 18100

150

200

250The SPE Limit Over Time: Adapt every 14 Days

Thr

esho

ld V

alue

Limit Over Time

0 5 10 15 204.6

4.8

5

5.2

5.4 Total Flows

Mea

n V

alue

Mean Over Time0 5 10 15 20

7.5

8

8.5

9 Duration

Mea

n V

alue


6.1

6.2

6.3

6.4

6.5

6.6 Source Packets

Mea

n V

alue

Mean Over Time

0 5 10 15 2012.7

12.8

12.9

13

13.1

13.2 Source Bytes

Mea

n V

alue


3.5

3.6

3.7

3.8 SRATE

Mea

n V

alue


6.29

6.295

6.3

6.305

6.31 Average Paket Size

Mea

n V

alue

Mean Over Time

Figure 6.9: SPE limit evolution over time using 14-day adaptation (top) alongwith the mean of six selected variables.

6.6 Summary

While the residual space of principal component analysis is very efficient in de-tecting new attacks, it suffers from many limitations when used for a real-timemonitoring system. These limitations include the limited scope of the fixed his-torical data sets, which have been used to train the model, meaning that theymay not be representative of the attack space, and the ever-changing nature of at-tack traffic. These limitations hinder the reliability of the technique in monitoringhoneypots in a real-time mode as the number of new attacks increases gradu-ally. These limitations were overcome by updating the detection model graduallythrough augmenting the newly collected traffic into the detection model.

Experimental results show that the proposed method is very effective in cap-turing changes in attack behaviors, in reducing number of traffic with high SPEvalues, and in adjusting the detection thresholds. In addition, the method is very

6.6. Summary 127

practical and has low storage requirements, as only the most recent block of trafficdata needs to be retained for updating the model. The computational requirementof updating the model is not a critical factor as the time interval between updatesis large (it can be days) and the amount of accumulated traffic in each interval issmall.

Chapter 7

Conclusion and Future Work

Extracting useful attack patterns from low-interaction honeypot traffic is a chal-lenging research problem. Several factors contribute to this challenge, which in-clude: the small amount of fine-grained information in the collected traffic, largetraffic volumes, and Internet noise. Added to these challenges is the lack of re-search in low-interaction honeypot traffic analysis, which the literature review hasindicated is a predominantly manual, ad hoc activity.

This research aimed to improve traffic analysis of low-interaction honeypots forbetter detection of anomalous Internet traffic. The research sought to:

• Improve the current analysis of honeypot traffic for better extraction of at-tack patterns;

• Research better analysis techniques that:

– suit the type of data that is collected by low-interaction honeypots,which is considered suspicious by definition and has less detail;

– are able to handle multidimensional data, or data with a large numberof variables;

– detect new attacks, automatically, with reduced human intervention;

– capture new trends and adapt to the dynamic nature of Internet attacks;and

– have computational requirements that are suitable for real time appli-cations.

129

130 Chapter 7. Conclusion and Future Work

This research has resulted in a number of significant contributions in each of thesesought directions. Section 7.1 reviews contributions and research directions forusing packet inter-arrival time (IAT) distributions to improve traffic clusters. Sec-tion 7.2 reviews contributions and research directions for structuring attackers’activities in low-interaction honeypot traffic using principal component analysis.Section 7.3 reviews contributions and research directions for detecting new attacksin honeypot traffic.

7.1 Improving the Leurré.com Clusters

The Leurré.com project approach to clustering honeypot traffic has generated alarge number of clusters, with some of these clusters sharing common attack fea-tures. The large number of clusters is due to several factors, which include the lowinteraction nature of the Leurré.com project platforms and the specific approachthat has been adapted for handling and analyzing the honeypot traffic. The largenumber of clusters has made it very difficult to reach accurate conclusions aboutthe exact nature of these clusters and to identify the tools that generated them.

To address these issues, a study was carried out to improve the interpretation ofthese clusters by grouping clusters that share similar types of activities. In Chapter3, a methodology was presented that overcomes the weaknesses in Leurré.com’sclustering algorithm. Our methodology was to group clusters that share similarpacket inter-arrival time (IAT) distributions. Using packet inter-arrival time (IAT)distributions, a number of cliques, or groups of clusters, were generated, whichrepresent a variety of interesting activities that target the Leurré.com platforms.

The cliques were characterized into three major types: Type I cliques whichrepresent classes of attacks where the attacker tries a given attack repeatedly; TypeII cliques of attacks exhibiting similar IAT characteristics, but targeting differentports; and Type III cliques that describe behaviors of sources which scan a largenumber of devices across the Internet and repeatedly return to the same platforms.

Areas for future work include developing a method to automate the process ofextracting these IATs and generating the cliques. These results could be incorpo-rated to improve the clustering algorithm in a systematic way. Additional work isalso required to investigate the evolution of the discovered cliques over time. Also,further research is needed to improve the analysis and increase our insight intoInternet threats through studying ways of constructing different types of cliques

7.2. Structuring Honeypot Traffic 131

based on other cluster’s features, such as the geographical location of the honeypotplatform and the attacker’s source IP, and then correlating these cliques together.

7.2 Structuring Honeypot Traffic

In Chapter 3, it was shown that the Leurré.com project’s approach in handling andanalyzing honeypot traffic has resulted in a large number of clusters, indeed, over27,000. Chapter 4 presented a new approach for manipulating honeypot trafficthat bypasses the Leurré.com project clusters. The new approach is based onprincipal component analysis (PCA), a multivariate statistical technique.

This study has demonstrated the power of principal component analysis (PCA)in detecting the structure of attackers’ activities in low-interaction honeypot trafficand in decomposing the traffic into seven dominant clusters. These dominantclusters comprise targeted attacks against open ports, scan activities, spam ormisconfiguration, repeated short activities, detection activities, targeted attacksagainst open UDP ports, and short attacks.

In addition, scatter plots of the PCs were proved to be very efficient in visu-alizing the interrelationships between the detected activities and in spotting theirclusters. This study has also shown the usefulness of principal component analysisin detecting different types of extreme attack activities through graphical inspec-tion of the first few and last few principal components’ plots.

Future work should investigate the global structure of all honeypot traffic, asthe traffic used in the study came from a single honeypot platform, the Australianplatform, due to the availability of those traffic logs. While all the platforms areidentical in their configurations, different types of activities might be observed. Itwould be of a high research value to investigate and compare the structures of datacollected from different platforms from different geographical locations. Anotherarea for future research is to investigate the stability of the detected structure overtime, since the Internet traffic is very dynamic and changes rapidly.

7.3 Detecting New Attacks

In Chapter 5, a technique was presented for detecting new attacks in low-interactionhoneypot traffic. This detection methodology utilized the residual space of princi-pal components in detecting new attacks through measuring the square prediction


error (SPE) values of their projections onto the residual space. SPE represents astatistical test that measures the distance of the residual space from the main PCAspace, defined by the PCA. Traffic that has a large SPE value represents a new di-rection that has not been captured by the PCA model and therefore needs furtherinvestigation. The technique is capable of detecting different types of attacks withno prior knowledge of these attacks and has a low computational requirement.

While the residual space of principal component analysis is very efficient indetecting new attacks, it suffers from some limitations when used for a real-timemonitoring system, which are the high number of false alarms or previously seenattacks recurring as being new, and the lack of an automated method of extract-ing the model parameters. In Chapter 6, these limitations have been addressedand an automatic detection model was proposed, based on the general algorithmproposed by Li et al. [94] that adapts recursively to changes in traffic throughaugmenting the newly collected traffic into the detection model. In addition, aproof of concept implementation of the proposed detection system for real-timeand offline applications was demonstrated.

One area for future research would be to improve the interpretation of thedetected traffic through modeling the residual space and eventually finding theright class of the attack [64]. In addition, another area for future research wouldbe to develop a methodology for building a global PCA model for monitoringhoneypot traffic, where the model incorporates traffic data from geographicallydistributed honeypot sensors in an adaptive way [118, 80].

Further work is also required to extend the model’s capability, such as integrat-ing the detection model with a high-interaction honeypot, to enable the creationof attack signatures and to perform further monitoring and logging of the detectedattacks. In addition, work is also needed to study how to integrate the detectionmodel results into network security devices, such as firewalls, to protect productionnetworks and block active attacks.

7.4 Conclusion

This research has highlighted the challenges of analyzing Internet traffic collectedby low-interaction honeypots to detect anomalies. New methods for improving thehoneypot traffic analysis to detect new attacks were proposed. The contributionsof the proposed methods to honeypot traffic analysis, and security more generally,

7.4. Conclusion 133

are the improved detection and characterization techniques of anomalous Internettraffic, and consequently, better protection of production networks.

Appendix A

Matlab Code

A.1 Extracting the Principal Components

This code computes the principal components from the correlation matrix andgenerates the principal scores, eigenvectors, and eigenvalues.

function [ PCS EigVec EigVal ] = R_PCA( X1, R1 )%R_PCA Summary of this function goes here% Detailed explanation goes herep = size(X1,2);[EigVec,EigVal] = eig(single(R1));EigVal = diag(EigVal);% Extract the diagonal elements% Order in descending orderEigVal = flipud(EigVal);EigVec = EigVec(:,p:-1:1);PCS = X1*EigVec;

A.2 Robustification Using Squared MahalanobiosDistance M 2

This matlab code accepts a traffic matrix and then removes outliers based on thealgorithm described in Chapter 5.

135

136 Appendix A. Matlab Code

function [ data Outliers] = R_Robust_Mah(data,N_Iterations)%ROBUST Summary of this function goes here% Detailed explanation goes here[n,p] = size(data);% get values for p: number of variables and n: number of observationsif nargin == 1N_Iterations=1; % number of iterationtions

endi=0; %iteration counterwhile (i<N_Iterations)Outliers=[ ]; % outliers to be trimedi=i+1;[n,p]=size(data);% get values for p: number of variables (rows) and n: number

of observations(columns)m=mean(data,1);S=inv(cov(data));d=zeros(n,1);for j=1:nd(j)=(data(j,:)-m)*S*(data(j,:)-m)’;

endOutliers=find(d> (mean(d)+3*std(d)));if isempty(Outliers) ||(n<p) , break, enddata(Outliers,:)=[ ]; % Discared outliers

end%—

A.3 Estimate Parameters

This code estimates the mean, standard deviation, and correlation matrix. It alsonormalizes the data matrix.

A.4. Recursive Mean 137

function [X1, Bk1, Nk1, Sk1, R1] = R_Estimate_Parmeters( Xk1, Bk0, Nk0, Sk0,R0)%R_ESTIMATE_PARMETERS mean, std, correlation matrix% Detailed explanation goes hereif nargin == 1[Bk1 Nk1]=R_Mean(Xk1);%meanSk1=R_Variance(Bk1,Nk1,Xk1);%stdX1=R_Normalize(Xk1,Bk1,Sk1);%normalizationR1=R_Correlation(X1);% Correlation

else[Bk1 Nk1]=R_Mean(Xk1,Bk0,Nk0);%meanSk1=R_Variance(Bk1,Nk1,Xk1,Bk0,Nk0,Sk0) ;%stdX1=R_Normalize(Xk1,Bk1,Sk1) ;%normalizationR1= R_Correlation(X1,Nk0,Nk1,R0,Sk0,Sk1,Bk0,Bk1) ;%Correlation

end%——

A.4 Recursive Mean

This matlab code computes the mean of the data matrix in a recursive way.


function [ Bk1 Nk1] = R_Mean(Xk1,Bk0,Nk0)%A_MEAN adaptive calculation of the mean of new block of a data% Bk0: Acumaltive mean of previous blocks% Nk0: Acumlative size of previous blocks% Xk1: New block of dataif nargin == 1Nk0=0;Bk0=zeros(size(Xk1,2),1);

endNk1=Nk0+size(Xk1,1);Lk1=ones(size(Xk1,1),1);Bk1=(Nk0/Nk1)*Bk0+(1/Nk1)*Xk1’*Lk1;%—-

A.5 Recursive Variance

This matlab code computes the variance of the data matrix in a recursive way.function [ Sk1 ] = R_Variance(Bk1,Nk1,Xk1,Bk0,Nk0,Sk0)%R_VARIANCE recursive calculation of the variance% Sk0: Old varianceN=size(Xk1,2);L=ones(size(Xk1,1),1);if nargin == 3Bk0=zeros(size(Xk1,2),1);Sk0=zeros(size(Xk1,2),1)’;Nk0=0;

endD=Bk1-Bk0;Sk1=[ ];for i=1:Na=(Nk0-1)*Sk0(i)^2;b=Nk0*D(i)^2;c=norm(Xk1(:,i)-L*Bk1(i))^2;Sk1=[Sk1, sqrt( (a+ b+ c) /(Nk1-1) )];

end%—-

A.6. Recursive Normalization 139

A.6 Recursive Normalization

This matlab code normalizes the data matrix in a recursive way.function [ X1 ] = R_Normalize(Xk1,Bk1,Sk1)%R_NORMALIZE subtract the mean and divide by the standard dev.%Lk1=ones(size(Xk1,1),1);X1=(Xk1-Lk1*Bk1’)*inv(diag(Sk1));

Bibliography

[1] Argos emulator home page. http://www.few.vu.nl/argos/, Last Visited June2009.

[2] Argus client. http://www.qosient.com/argus/, Last Visited Feb. 2008.

[3] The common vulnerabilities and exposures (CVE) home page.http://cve.mitre.org/, Last Visited Feb. 2009.

[4] Dshield home page. http://www.dshield.org/index.php, Last Visited Feb.2009.

[5] The gnu netcat home project. http://netcat.sourceforge.net/, Last VisitedJune 2009.

[6] Honeyc web site. http://sourceforge.net/projects/honeyc , Last Visited June2009.

[7] Hping home page. http://www.hping.org/, Last Visited Oct. 2008.

[8] The internet engineering task force. http://www.ietf.org/, Last Visited June2009.

[9] The leurre.com project home page. http://www.leurrecom.org, Last VisitedOct. 2008.

[10] Matlab r2007b. http://www.mathworks.com, Last Visited Feb. 2009.

[11] The MITRE honeyclient project team. http://www.honeyclient.org/trac,Last visited June 2009.

[12] Nepenthes - finest collection. http://nepenthes.carnivore.it/, Last VisitedJune 2009.

141

142 BIBLIOGRAPHY

[13] Nessus vulnerability scanner home page. http://www.nessus.org/nessus/,Last Visited Oct. 2008.

[14] Network telescope home page. http://www.caida.org/analysis/security/telescope/,Last Visited Feb. 2009.

[15] Nist/sematech e-handbook of statistical methods.http://www.itl.nist.gov/div898/handbook/, June 2009.

[16] Python programming language 2.5.1. http://www.python.org/.

[17] Qemu. http://www.qemu.org/, Last Visited June 2009.

[18] Sax home page. http://www.cs.ucr.edu/ eamonn/SAX.htm, Last VisitedDEC. 2009.

[19] Team cymru-darknet home page. http://www.cymru.com/Darknet/, LastVisited May 2009.

[20] University of michigan: Internet motion sensor home page.http://ims.eecs.umich.edu/, Last Visited Feb. 2006.

[21] The user-mode linux kernel home page. http://user-mode-linux.sourceforge.net, Last visited Feb. 2009.

[22] Vmware server 2. http://www.vmware.com/, LastVisited Feb. 2008.

[23] Determining the minimum sample size of audit data required to profile userbehavior and detect anomaly intrusion. International Journal of BusinessData Communications and Networking, 2(3):31–45, July-Sept 2006.

[24] Tcpdump home page, 2007. http://www.tcpdump.org, Last Visited April2007.

[25] A. Lakhina, M. Crovella, and C. Diot. Characterization of network-wideanomalies in traffic flows. In ACM-SIGCOMM Internet Measurement Con-ference, 2004.

[26] Kulsoom Abdullah, Chris Lee, Greg Conti, John A. Copeland, and JohnStasko. Ids rainstorm: Visualizing ids alarms. In VizSEC 2005, October2005.

BIBLIOGRAPHY 143

[27] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining associationrules between sets of items in large databases. In Proceedings of the 1993ACM SIGMOD International Conference on Management of Data, Wash-ington, D.C, US, 1993.

[28] Ejaz Ahmed, Andrew Clark, and George Mohay. A novel sliding windowbased change detection algorithm for asymmetric traffic. In Proceedings ofthe 2008 IFIP International Conference on Network and Parallel Computing,Shanghai, China, 2008.

[29] Levenshtein Association. The levenshtein algorithm.http://www.levenshtein.net/, Last Visited June 2009.

[30] Michael Bailey, Evan Cooke, Farnam Jahanian, Jose Nazario, and DavidWatson. The Internet motion sensor: A distributed blackhole monitoringsystem. In The 12th Annual Network and Distributed System Security Sym-posium, San Diego, California, 2005.

[31] D. Barbará, Jajodia S. Popyack L. Couto, J., and N. Wu. Adam: Detectingintrusions by data mining. In Proceedings of the 2001 IEEE Workshop onInformation Assurance and Security, United States Military Academy, WestPoint, NY, USA, June 2001.

[32] D. Barbará, N. Wu, and S. Jajodia. Detecting novel network intrusions usingbayes estimators. In Proceedings of of the 1st SIAM International Conferenceon Data Mining (SDM’01), 2001.

[33] P. Barford and D. Plonka. Characteristics of network traffic flow anomalies.In Proceedings of ACM SIGCOMM Internet Measurement Workshop, SanFrancisco, CA, USA, November 2001.

[34] Paul Barford, Jeffery Kline, David Plonka, and Amos Ron. A signal analysisof network traffic anomalies. In Proceedings of Acm SIGCOMM InternetMeasurement workshop, 2002.

[35] Vic Barnett and Toby Lewis. Outliers in Statistical Data. Wiley, 3rd edition,1994.

144 BIBLIOGRAPHY

[36] Robert M. Bell, Yehuda Koren, and Chris Volinsky. The Bellkor 2008 solu-tion to the netflix prize. Statistics Research Department at AT&T Research,2008.

[37] Steven M. Bellovin. A look back at "security problems in the tcp/ip protocolsuite". In 20th Annual Computer Security Applications Conference (ACSAC),2004.

[38] S. Bersimis, S. Psarakis, and J. Panaretos. Multivariate statistical processcontrol charts: an overview. Quality and Reliability Engineering Interna-tional, 23 Issue 5(5):517–543, 2006.

[39] Bitdefender. Win32.Worm.Bobax, May 2004.

[40] I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo. The MaximumClique Problem, volume 4, chapter Handbook of Combinatorial Optimiza-tion, pages 1–74. Kluwer Academic Publishers,Boston, MA., 1999.

[41] I. M. Bomze, M. Pelillo, and V. Stix. Approximating the maximum weight.clique using replicator dynamics. IEEE Transactions on Neural Networks,11:1228–1241, 2000.

[42] Yacine Bouzida, Frederic Cuppens, Nora Cuppens-Boulahia, and SylvainGombault. Efficient intrusion detection using principal component analy-sis. In 3éme Conférence sur la Sécurité et Architectures Réseaux (SAR), LaLonde, France, June 2004.

[43] Coen Bron and Joep Kerbosch. Algorithm 457: Finding all cliques of anundirected graph. Communications of the ACM, 16(9):575–577, 1973.

[44] Tian Bu, Aiyou Chen, Scott Vander Wiel, and Thomas Woo. Design andevaluation of a fast and robust worm detection algorithm. In Proceedings ofIEEE Infocom’06, 2006.

[45] Bill Cheswick. An evening with Berferd: In which a cracker is lured, endured,and studied. In USENIX, Jan. 1990.

[46] Kimberly C. Claffy, Hans-Werner Braun, and George C. Polyzos. A param-eterizable methodology for internet traffic flow profiling. EEE Journal ofSelected Areas in Communications, 13(8):481–1494, 1995.

BIBLIOGRAPHY 145

[47] Gregory Conti and Kulsoom Abdullah. Passive visual fingerprinting of net-work attack tools. In Proceedings ACM workshop on Visualization and datamining for computer security, pages 45 – 54, Washington DC, USA, 2004.ACM.

[48] Evan Cooke, Michael Bailey, Z. Morley Mao, David Watson, Farnam Jaha-nian, and Danny McPherson. Toward understanding distributed blackholeplacement. In Proceedings of the 2004 ACM workshop on Rapid Malcode,pages 54–64, Washington DC, USA, 2004. ACM Press.

[49] Weidong Cui, Vern Paxson, Nicholas C. Weaver, and Y H. Katz. Protocol-independent adaptive replay of application dialog. In The 13th Annual Net-work and Distributed System Security Symposium (NDSS), 2006.

[50] David Dagon, Xinzhou Qin, Guofei Gu, Wenke Lee, Julian Grizzard, JohnLevine, and Henry Owen. Honeystat: Local worm detection using honeypots.In Recent Advances in Intrusion Detection (Raid), pages 39–58, France, 2004.

[51] Esphion. Packet vs flow-based anomaly detection. White Paper, 2005. Avial-ble from: www.esphion.com/.

[52] Laura Feinstein, Dan Schnackenberg, Ravindra Balupari, and Darrell Kin-dred. Statistical approaches to ddos attack detection and response. In Pro-ceedings of the DARPA Information Survivability Conference and Exposition(DISCEX’ 03), 2003.

[53] Bernhard Flury. A First Course in Multivariate Statistics. Springer texts instatistics. Springer, New York, 1997.

[54] Imola Fodor. A survey of dimension reduction techniques. Lawrence Liver-more National Laboratory, Technical Report UCRL-ID-148494, 2002.

[55] The Internet Engineering Task Force. User datagram protocol (rfc768), Au-gust 1980. http://tools.ietf.org/html/rfc768, Last Visited June 2009.

[56] The Internet Engineering Task Force. Internet control message protocol(rfc792), September 1981. http://tools.ietf.org/html/rfc792, Last VisitedJune 2009.

[57] The Internet Engineering Task Force. Internet protocol (rfc790), September1981. http://tools.ietf.org/html/rfc791, Last Visited June 2009.

146 BIBLIOGRAPHY

[58] The Internet Engineering Task Force. Transmission control protocol (rfc793),September 1981. http://tools.ietf.org/html/rfc793, Last Visited June 2009.

[59] The Internet Engineering Task Force. Requirements for in-ternet hosts – communication layers (rfc1122), October 1989.http://tools.ietf.org/html/rfc1122, Last Visited June 2009.

[60] Behrouz A Forouzan. TCP/IP Protocol Suite. McGraw-Hill, 2 edition, 2002.

[61] Fyodor. Top 75 security tools. http://www.insecure.org/tools.html, LastVisited Feb. 2006.

[62] Fyodor. Nmap security scanner, 2008. http://nmap.org/, Last Visited June2009.

[63] Carrie Gates, Michael Collins, Michael Duggan, Andrew Kompanek, andMark Thomas. More netflow tools: For performance and security. In Pro-ceedings of the 18th Large Installation Systems Administration Conference(LISA 2004), 2004.

[64] Janos Gertler, Weihua Li, Yunbing Huang, and Thomas McAvoy. Isolationenhanced principal component analysis. AIChE, 45(2):323–334, February1999.

[65] R. Gnanadesikan. Methods for Statistical Data Analysis of Multivariate Ob-servations. Wiley-Interscience Publication, New York, 2nd edition, 1997.

[66] Yiming Gong. Detecting worms and abnormal activities with netflow,September 2004. http://www.securityfocus.com/infocus/1796, Last VisitedJune 2009.

[67] Rohitha Goonatilake, Ajantha Herath, Suvineetha Herath, Susantha Herath,and Jayantha Herath. Intrusion detection using the chi-square goodness-of-fit test for information assurance, network, forensics and software security.J. Comput. Small Coll., 23(1):255–263, 2007.

[68] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques.Morgan Kaufmann, San Francisco, 1st edition edition, 2000.

[69] Simon Hansman and Ray Hunt. A taxonomy of network and computerattacks. Computers and Security, 24(1):31 – 43, 2005.

BIBLIOGRAPHY 147

[70] Honeynet-Project. Know your enemy: Gen II honeynets, May 2005.http://old.honeynet.org/papers/gen2/, Last Visited June 2009.

[71] John D. Howard. An Analysis Of Security Incidents On The Internet 1989-1995. Doctor of philosophy, Carnegie Mellon University, 1997.

[72] Alefiya Hussain, John Heidemann, and Christos Papadopoulos. A frameworkfor classifying denial of service attacks. In SIGCOMM ’03: Proceedings of the2003 conference on Applications, technologies, architectures, and protocolsfor computer communications, pages 99–110, New York, NY, USA, 2003.ACM.

[73] The SANS Institute. Internet storm center. http://dshield.org/, Last visitiedMay 2009.

[74] I.T.Jollif. Principal Component Analysis. Springer Series in Statistics.Springer, New York, 2nd edition, 2002.

[75] J. Edward Jackson. A User’s Guide to Principal Components. Wiley-Interscience, 1st edition, 2003.

[76] Yu Jin, Zhi-Li Zhang, Kuai Xu, Feng Cao, and Sambit Sahu. Identifyingand tracking suspicious activities through ip gray space analysis. In Proceed-ings of the 3rd annual ACM workshop on Mining network data, San Diego,California, USA, 2007.

[77] J. D. Jobson. Applied Multivariate Data Analysis Volume 1:Regression andExperimental Design. Springer-Verlag, 1991.

[78] Klaus Julisch. Data mining for intrusion detection: A critical review. InDaniel Barbara and Sushil Jajodia, editors, Applications of Data Mining inComputer Security. Kluwer Academic Publisher, Boston, 2002.

[79] Klaus Julisch. Clustering intrusion detection alarms to support root causeanalysis. ACM Transactions on Information and System Security (TISSEC),6(4):443 – 471, November 2003.

[80] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and ErikJohnson. Distributed clustering using collective principal component analy-sis. Knowledge and Information Systems, 3(4):422–448, November 2001.

148 BIBLIOGRAPHY

[81] Myung-Sup Kim, Hun-Jeong Kong, Seong-Cheol Hong, Seung-Hwa Chung,and J.W Hong. A flow-based method for abnormal network traffic detec-tion. In Proceedings of IEEE/IFIP Network Operations and ManagementSymposium (NOMS 2004), Seoul, Korea, April 2004.

[82] Sven Krasser, Gregory Conti, Julian Grizzard, Jeff Gribschaw, and HenryOwen. Real-time and forensic network data analysis using animated andcoordinated visualization. In The 2005 IEEE Workshop on Information As-surance, United States Military Academy, West Point, NY, 2005.

[83] Christian Kreibich and Jon Crowcroft. Honeycomb - creating intrusion de-tection signatures using honeypots. In Proceedings of the Second Workshopon Hot Topics in Networks, 2003.

[84] D. Kumlander. An exact algorithm for the maximum-weight clique problembased on a heuristic vertex-coloring, chapter Computation and Optimizationin Information Systems and Management Sciences, pages 202 – 208. HermesScience Publishing, 2004.

[85] D. Kumlander. A new exact algorithm for the maximum-weight clique prob-lem based on a heuristic vertex-coloring and a backtrack search. In Pro-ceedings of the 4th International Conference on Engineering ComputationalTechnology, pages pp. 137–138. Civil-Comp Press, 2004.

[86] Khaled Labib and V. Rao Vemuri. An application of principal componentanalysis to the detection and visualization of computer network attacks.Annals of Telecommunications, pages 218–234, 2005. Nov-Dec Issue.

[87] A Lakhina, K Papagiannaki, M Crovella, C Diot, E Kolaczyk, and N Taft.Structural analysis of network traffic flows. In ACM SIGMETRICS, 2004.

[88] Anukool Lakhina, Mark Crovella, and Christophe Diot. Diagnosing network-wide traffic anomalies. In SIGCOMM ’04: Proceedings of the 2004 confer-ence on Applications, technologies, architectures, and protocols for computercommunications. ACM, 2004.

[89] Kiran Lakkaraju, William Yurcik, and Adam J. Lee. Nvisionip: net-flow visualizations of system state for security situational awareness. In

BIBLIOGRAPHY 149

VizSEC/DMSEC ’04: Proceedings of the 2004 ACM workshop on Visualiza-tion and data mining for computer security, pages 65–72, New York, NY,USA, 2004. ACM.

[90] Chris Lee, Jason Trost, Nick Gibbs, Raheem Beyah, and John Copeland.Visualfirewall: A firewall visualization tool for network management andsecurity analysis. In VizSEC 2005, October 2005.

[91] Wenke Lee and Salvatore J. Stolfo. Data mining approaches for intrusiondetection. Computer Science Department, Columbia University, July 2001.

[92] Corrado Leita and Marc Dacier. SGNET: a worldwide deployable frame-work to support the analysis of malware threat models. In 7th EuropeanDependable Computing Conference (EDCC 2008), Lithuania, May 7-9 2008.

[93] Corrado Leita, Ken Mermoud, and Marc Dacier. Script Gen: An automatedscript generation tool for honeyd. In 21st Annual Computer Security Appli-cations Conference (ACSAC), Dec 2005.

[94] Weihua Li, H. Henry Yue, Sergio Valle-Cervantes, and S. Joe Qin. Recursivepca for adaptive process monitoring. Journal of Process Control, 10(5):471– 486, 2000.

[95] Yihua Liao and V. Rao Vemuri. Use of k-nearest neighbor classifier forintrusion detection. Computers and Security, 21(5):439–448, 2002.

[96] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolicrepresentation of time series, with implications for streaming algorithms. Inproceedings of the 8th ACM SIGMOD Workshop on Research Issues in DataMining and Knowledge Discovery, San Diego, CA, USA, 2003.

[97] Tom Liston. LaBrea: "sticky" honeypot and IDS,http://labrea.sourceforge.net/labrea-info.html, Last Visited June 2009.

[98] Robert L. Mason and John C. Young. Multivariate statistical process controlwith industrial applications. Society for Industrial and Applied Mathematics,Philadelphia, PA, USA, 2002.

[99] Douglas C. Montgomery. Introduction to Statistical Quality Control. JohnWiley & Sons, Inc, New York, 4th edition, 2004.

150 BIBLIOGRAPHY

[100] David Moore, Colleen Shannon, Douglas J. Brown, Geoffrey M. Voelker, andStefan Savage. Inferring Internet denial-of-service activity. ACM Transac-tions on Computer Systems, 24(2):115–139, 2006.

[101] Gerhard Munz and Georg Carle. Real-time analysis of flow data for networkattack detection. In Proceedings of IFIP/IEEE Symposium on IntegratedManagement 2007 (IM2007), Munich, Germany, 2007.

[102] Jose Nazario. Defense and Detection Strategies against Internet Worms.Artech House, Norwood , MA, 2004.

[103] Honeytrap Project Home Page. http://sourceforge.net/projects/honeytrap/,Last Visited June 2009.

[104] M. Pavan and M. Pelillo. A new graph-theoretic approach to clustering andsegmentation. In In the proceedings of IEEE Conference on Computer Visionand Pattern Recognition, 2003.

[105] Van-Hau Pham, Marc Dacier, Guillaume Urvoy-Keller, and Taoufik En-Najjary. The quest for multi-headed worms. In 5th Conference on Detectionof Intrusions and Malware & Vulnerability Assessment, Paris, France, 2008,July 10-11 2008.

[106] Dave Plonka. Flowscan: A network traffic flow reporting and visualizationtool. In LISA ’00: Proceedings of the 14th USENIX conference on Systemadministration, pages 305–318, Berkeley, CA, USA, 2000. USENIX Associa-tion.

[107] Georgios Portokalidis and Herbert Bos. Sweetbait: Zero-hour worm detec-tion and containment using low- and high-interaction honeypots. Comput.Netw., 51(5):1256–1274, 2007.

[108] F. Pouget, M. Dacier, and V.H. Pham. Towards a better understandingof internet threats to enhance survivability. In International InfrastructureSurvivability Workshop 2004(IISW’04), Lisbonne, Portugal, 2004.

[109] F. Pouget, M. Dacier, V.H. Pham, and H. Debar. Honeynets: Foundationsfor the development of early warning information systems. In The CyberspaeSecurity and Defnse: Resarch Issues. Springler-Verlag, 2005.

BIBLIOGRAPHY 151

[110] F. Pouget, M. Dacier, J. Zimmerman, A. Clark, and G. Mohay. Internetattack knowledge discovery via clusters and cliques of attack traces. Journalof Information Assurance and Security, 1(1):21–32, 2006.

[111] Fabien Pouget. Distributed system of Honeypot Sensors: Discriminationand Correlative Analysis of Attack Processes. Thesis, the Ecole NationaleSupérieure des Télécommunications, January 2006. available through theEurécom Institute library (www.eurecom.fr).

[112] Fabien Pouget and Marc Dacier. Honeypot-based forensics. In AusCERTAsia Pacific Information technology Security Conference, 2004.

[113] Fabien Pouget, Marc Dacier, and Herv Debar. Honeypots, a practical meanto validate malicious fault assumptions. In 10th International symposium Pa-cific Rim dependable computing Conference, Tahiti, French Polynesia, March3-5 2004.

[114] Fabien Pouget, Guillaume Keller-Urvoy, and Marc Dacier. Time signaturesto detect multi-headed stealthy attack tools. In 18th Annual FIRST Con-ference, June 25-30 2006.

[115] The Honeynet Project. Know your enemy: Honeynets, May 2006.http://old.honeynet.org/papers/honeynet/index.html, Last Visited: Feb2009.

[116] Niels Provos. A virtual honeypot framework. In 13th USENIX SecuritySymosium, Aug 2004.

[117] Niels Provos and Thorsten Holz. Virtual Honeypots: From Botnet Trackingto Intrusion Detection. Addison Wesley Professional, 2007.

[118] H. Qi, T. Wang, and D. Birdwell. Global Principal Component Analysisfor Dimensionality Reduction in Distributed Data Mining, chapter 19, pages327–342. CRC Press, 2004.

[119] Guangzhi Qu, S. Hariri, and M. Yousif. Multivariate statistical analy-sis for network attacks detection. In AICCSA ’05: Proceedings of theACS/IEEE 2005 International Conference on Computer Systems and Ap-plications, page 9, Washington, DC, USA, 2005. IEEE Computer Society.

152 BIBLIOGRAPHY

[120] Alvin Rencher. Methods of Multivariate Analysis. 2nd edition, 2002.

[121] Symantec Security Response. W32.mydoom.b, Jan. 2004.

[122] Symantec Security Response. Backdoor.evivinc, Feb. 2007.

[123] Symantec Security Response. W32.rahack.w, Jan. 2007.

[124] James Riordan, Diego Zamboni, and Yann Duponchel. Building and de-ploying billy goat, a worm-detection system. In Proceedings of 18th FIRSTConference, Baltimore, United States, June 2006.

[125] Sam Roweis. Em algorithms for pca and spca. In MIT press, editor, Advancesin Neural Information Processing Systems, volume 10, pages 626–632, 1998.

[126] Challa S. Sastry, Sanjay Rawat, Arun K. Pujari, and V.P. Gulati. Networktraffic analysis using singular value decomposition and multiscale transforms.Information Sciences: an International Journal, 177(23):5275–5291, 2007.

[127] Christian Seifert, Ramon Steenson, Thorsten Holz, Bing Yuan, and MichaelDavis. Know your enemy: Malicious web servers, August 2007.

[128] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang.A novel anomaly detection scheme based on principal component classier.In Proceedings of the IEEE Foundations and New Directions of Data MiningWorkshop, in conjunction with the Third IEEE International Conference onData Mining (ICDM’03), pages 172–179, 2003.

[129] Lance Spitzner. Honeypots: Tracking Hackers. Addison-Wesley, 2003.

[130] Stuart Staniford, Vern Paxsony, and Nicholas Weaver. How to 0wn theInternet in your spare time. In The 11th USENIX Security Symposium,2002.

[131] Symantec. Symantec global Internet security threat report trends for July-December 07. Technical report, April 2008.

[132] Cisco Systems. Cisco IOS NetFlow, 2007. http://www.cisco.com, Last Vis-ited Feb 2008.

BIBLIOGRAPHY 153

[133] J. Terrell, K. Jeffay, L. Zhang, H. Shen, Z. Zhu, and A. Nobel. MultivariateSVD analyses for network anomaly detection. In Proc. of the ACM SIG-COMM Conference, Poster Session, Philadelphia, PA,USA, 2005.

[134] Olivier Thonnard and Marc Dacier. A framework for attack patterns dis-covery in honeynet data. In The 8th Digital Forensics Research Conference(DFRWS’08), Baltimore, MD, USA, 2008.

[135] Ruey Tsay, Daniel Pena, and Alan Pankratz. Outliers in multivariate timeseries. Biometrika Trust, 87(4):789–804, Dec. 2000.

[136] Pal Varga. EUNICE 2005: Networks and Applications Towards a Ubiqui-tously Connected World, chapter Analyzing Packet Interarrival Times Distri-bution to Detect Network Bottlenecks, pages pp. 17–29. IFIP InternationalFederation for Information Processing. 2006.

[137] Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C.Snoeren, Geoffrey M. Voelker, and Stefan Savage. Scalability, fidelity, andcontainment in the potemkin virtual honeyfarm. SIGOPS Oper. Syst. Rev.,39(5):148–162, 2005.

[138] Haining Wang, Danlu Zhang, and Kang G. Shin. Detecting syn floodingattacks. In Proceedings of the IEEE Infocom, New York, NY, USA, 2002.

[139] Wei Wang, Xiaohong Guan, and Xiangliang Zhang. Processing of massiveaudit data streams for real-time anomaly intrusion detection. ComputerCommunications, Elsevier, 31(1):58–72, 2008.

[140] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski,Shuo Chen, and Sam King. Automated web patrol with strider honeymon-keys: Finding web sites that exploit browser vulnerabilities. In The 13thAnnual Network and Distributed System Security Symposium (NDSS’06),San Diego, California, USA, 2006.

[141] EdwardWegman. Hyperdimensional data analysis using parallel coordinates.Journal of the American Statistical Association, 85(411):664–675, 1990.

[142] Jiang Xuxian and Dongyan Xu. Collapsar: A VM-based architecture fornetwork attack detention center. In Proceedings of the 13th USENIX SecuritySymposium (Security 2004), San Diego, CA, USA, August 2004.

154 BIBLIOGRAPHY

[143] Guanhua Yan, Zhen Xiao, and Stephan Eidenbenz. Catching instant mes-saging worms with change-point detection techniques. In Proceedings of theUSENIX Workshop on Large-Scale Exploits and Emergent Threats, 2008.

[144] Fyodor Yarochkin, Ofir Arkin, and Meder Kydyraliev. Xprobe2 - active OSfingerprinting tool, 2008.

[145] Nong Ye and Qiang Chen. An anomaly detection technique based on achi-square statistic for detecting intrusions into information systems. Inter-national, 17:105–112, 2001.

[146] Nong Ye, Syed Masum Emran, Qiang Chen, and Sean Vilbert. Multivariatestatistical analysis of audit trails for host-based intrusion detection. IEEETrans. Comput., 51(7):810–820, 2002.

[147] Nong Ye, Sean Vilbert, and Qiang Chen. Computer intrusion detectionthrough EWMA for auto correlated and uncorrelated data. IEEE Transac-tions on Reliability, 52:75–82, 2003.

[148] Vinod Yegneswaran, Chris Alfeld, Paul Barford, and Jin-Yi Cai. Camouflag-ing honeynets. In Proceedings of IEEE Global Internet Symposium, 2007.

[149] Vinod Yegneswaran, Paul Barford, and Johannes Ullrich. Internet intrusions:Global characteristics and prevalence. ACM SIGMETRICS, 2003.

[150] L. Zhang, J. S. Marron, H. Shen, and Z. Zhu. Singular value decompositionand its visualization. Journal of Computational and Graphical Statistics,16(4):pp. 833–854, 2007.

[151] Jacob Zimmermann, Andrew Clark, George Mohay, Fabien Pouget, andMarc Dacier. The use of packet inter-arrival times for investigating un-solicited internet traffic. In First International Workshop on Systematic Ap-proaches to Digital Forensic Engineering (SADFE’05), 2005.

using honeypots to analyse anomalous internet activities · using honeypots to analyse anomalous...

Documents