d2.2 data collection infrastructure - vis-sense · 1.1 data collection infrastructure fig. 1.1...

SEVENTH FRAMEWORK PROGRAMMEArea ICT-2009.1.4 (Trustworthy ICT)

Visual Analytic Representation of Large Datasetsfor Enhancing Network Security

D2.2 Data collection infrastructure

Contract No. FP7-ICT-257495-VIS-SENSE

Workpackage WP 2 - Network Data Collection InfrastructureAuthor Olivier ThonnardVersion 1Date of delivery M18Actual Date of Delivery M18Dissemination level RestrictedResponsible SYMANTECData included from SYM, EUR, IGD

The research leading to these results has received funding from the European Community’sSeventh Framework Programme (FP7/2007-2013) under grant agreement n°257495.

SEVENTH FRAMEWORK PROGRAMMEArea ICT-2009.1.4 (Trustworthy ICT)

The VIS-SENSE Consortium consists of:

Fraunhofer IGD Project coordinator GermanyInstitut Eurecom FranceInstitut Telecom FranceCentre for Research and Technology Hellas GreeceSymantec Ltd. IrelandUniversitat Konstanz Germany

Contact information:Dr Jorn KohlhammerFraunhofer IGDFraunhoferstraße 564283 DarmstadtGermany

e-mail: [email protected]: +49 6151 155 646

[email protected]

Contents

1 Introduction 61.1 Data collection infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Interfaces and interactions with upper layers . . . . . . . . . . . . . . . . . 9

2 Interfaces - WAPI v2 112.1 WAPI refresher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 WAPI concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Design and rationales . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Improvements introduced in WAPI v2 . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Code maintainability . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Long-lived interactions . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.3 WAPI 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 WAPI over JAVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 VIS-SENSE Data Sets 183.1 Honeypot traces (SGNET) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2 Data schema exposed . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.3 Objects, Methods and References . . . . . . . . . . . . . . . . . . . 21

3.2 Client-side threats (HARMUR) . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29



3.3 Spamtrap data (SpamCloud) . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44



3.4 BGP data sets (Spamtracer - BGPDB) . . . . . . . . . . . . . . . . . . . . 50

3.4.1 Control-plane Data - BGPDB . . . . . . . . . . . . . . . . . . . . . 50

3.4.2 Forwarding-plane Data - SpamTracer . . . . . . . . . . . . . . . . 57

3

4 Interaction with Upper Layers (Preview) 744.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Two-Level Cluster/Prototype Representation . . . . . . . . . . . . . . . . 774.3 TRIAGE-as-a-Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Conclusions 84

4

Abstract

The deliverable D2.2 describes the data collection infrastructure developed by the VIS-SENSE partners in the context of Work Package 2. WP2 aims at building a unifiedmeasurement platform to enable the correlation of several heterogeneous sources of in-formation, and to enable the analyses defined in the two user scenarios (as defined inD1.2 - Use case analysis and user requirements). The integrated information sourcesfall into two different categories. The first family corresponds to infrastructure-relatedinformation that pertains to the Internet routing protocol and will be used primarilyin the BGP analysis scenario. The second set of information sources relates to variousthreats observed in the Internet (e.g., client-side threats, server code injections and spamdata), which will serve as input to the user scenario on the visualization of the Internetthreat landscape.

In this deliverable, we describe the design and implementation of the VIS-SENSEdatabase infrastructure. More specifically, we describe the information available in eachdata set, how the selected data relates to the previously defined user scenarios, and whichspecific attributes and methods are being exposed over a remotely accessible interfacecalled WAPI. Finally, we explain how the raw data was enriched, clustered and indexedbased on the way it will be eventually manipulated by the upper layers of the VIS-SENSEframework.

Deliverable D2.2 and the associated software prototypes contribute to reaching themilestone “M4: VIS-SENSE data collection infrastructure”.

1 Introduction

The main objective of Work Package WP2 is to define and collect the appropriate inputsrepresenting both normal and malicious Internet activity. These inputs will be usedas the fundamental representation of information to be fed into the final visualizationframework. The work performed in D2.1, in which we have reviewed all relevant sourcesof information, was a primordial step towards this objective. This preliminary workhas enabled us to select the most appropriate data sets to build our infrastructure,develop the associated software prototypes and thus to achieve the intermediate projectobjectives.

As described in the DoW, WP2 aims at building a unified measurement platform thatwill enable the correlation of several heterogeneous sources of information for different usecases which were defined in deliverable D1.2, with respect to attack attribution scenariosand attacks against the control plane (BGP). Those sources of information fall into twomain categories. The first family corresponds to infrastructure-related information thatpertains to the core routing protocol of the Internet, and will be used primarily in theBGP analysis scenario. The second set of information sources relates to various threatsobserved in the Internet (e.g., client-side threats, server code injections and spam data),which will serve as input to the user scenario on the visualization of the Internet threatlandscape.

In this deliverable, we describe the design and implementation of the VIS-SENSEdatabase infrastructure. More specifically, we describe the information available in eachdata set, how the selected data relates to or will serve the previously defined user sce-narios (D1.2), and which attributes and methods are being exposed over a remotelyaccessible programming interface. Finally, we explain how the raw data is enriched,clustered and indexed based on the way it will be eventually manipulated by the upperlayers of the VIS-SENSE framework.

The rest of this document is organized as follows: Section 1.1 gives an overview ofthe data collection infrastructure developed by the VIS-SENSE partners, while Section1.2 introduces the external interfaces and offered services, enabling thus some futureinteractions with the upper layers of the visualization framework. In Section 2, wedescribe the WAPI technology that we have further developed to provide a remotelyaccessible programming interface for querying each data set over the network, as well asall improvements and new features brought to this remote API. Section 3 details each

6

1.1 Data collection infrastructure

data set that has been integrated into the database infrastructure, and describes allspecific attributes and methods that are being exposed over WAPI. Finally, in Section4 we explain how the raw data was enriched, clustered, indexed and analyzed usingspecific data analytics techniques (namely the triage analysis framework [11, 14]), inorder to enable more advanced processing and visualization tasks in the upper layersof the VIS-SENSE framework, such as visual attack attribution and advanced networkcorrelation based on visual analytics.

1.1 Data collection infrastructure

Fig. 1.1 gives an overview of the VIS-SENSE data collection infrastructure, which com-prise different data sets that were specifically selected to provide the data required forthe previously defined analysis scenarios. For the goal pursued in Scenario 1 (Visual-ization of the Internet threat landscape), the following data sets have been selected andintegrated into the database infrastructure:

• sgnet, a distributed honeypot deployment aiming at collecting network infor-mation on Internet malicious activity, in particular code injection attacks andself-propagating malware;

• harmur (the Historical ARchive of Malicious URLs), a comprehensive data setproviding historical data on client-side threats, such as malicious domains, websitesdistributing fake software products (e.g., fake anti-virus), etc;

• SpamCloud, a large repository of spam emails collected through worldwide-distributedspam traps, and enriched with various contextual information (bot signature, IPgeolocation, OS fingerprinting, embedded URIs, SMTP or header fields, etc).

Some of these data sets, such as sgnet and harmur, were initially developed inthe context of the WOMBAT EU-FP7 project1. As described in the next Sections, wehave built upon those wombat development efforts to eventually integrate and maintainthose data sets within a more comprehensive infrastructure, by also enriching them withnew features when necessary. On the other hand, SpamCloud is a completely new dataset developed during VIS-SENSE and aiming at giving access to extensive spam emaildata.

For Scenario 2 (Visual analysis of BGP attacks), the following data sets have beendeveloped and are now being maintained in the current infrastructure:

1The Worldwide Observatory of Malicious Behaviors and Threats (WOMBAT EU-FP7 Project), http://www.wombat-project.eu

FP7-ICT-257495-VIS-SENSE 7

http://www.wombat-project.eu


1 Introduction

• SpamTracer, a system for collecting and storing traceroutes and IP/AS pathinformation towards spamming IP adresses;

• BGPDB, a system for collecting and parsing BGP update messages from RIPEand RouteViews data repositories, and structuring them in an efficient databasesystem for easy query access.

The two BGP-related data sets are completely new and have been specifically devel-oped in VIS-SENSE to enable the analysis and visualization of attacks against the controlplane of the Internet, such as BGP hijacking attacks. In particular, those data sets willbe instrumental in the project to confirm or reject the open conjecture of spammers whomay abuse the routing infrastructure by hijacking unused IP blocks to eventually sendspam in a stealthy way (the so-called fly-by spammers phenomenon).

VIS-SENSE Platform

SGNET

BGP - Data plane - Control plane

HARMUR

SpamCloud

!Σ!TRIAGE'

WAPI!

WAPI!

WAPI!

WAPI!

DATA'INFRASTRUCTURE'

WAPI

WA

PI

WA

PI

Scenario 1 Scenario 2

Figure 1.1: Overview of the VIS-SENSE data infrastructure

8 SEVENTH FRAMEWORK PROGRAMME

1.2 Interfaces and interactions with upper layers

1.2 Interfaces and interactions with upper layers

Since data sets may be maintained by different VIS-SENSE partners during (or evenafter) the project, a decision was made to decouple the visualization framework and thedifferent network analytics algorithms, from the data collection infrastructure. There-fore, some remote access mechanisms were developed to provide an external data inter-face but also to offer various services to the upper layer of the framework, i.e., the visualanalytics layer. This was achieved through the development of a remote API which wasused to open a programmatic access to each data set. This VIS-SENSE API middlewarewas developed by taking advantage of the wombat API2 (or WAPI). This is representedin Fig. 1.1 by the dashed lines connecting the VIS-SENSE framework with every dataset of the data collection infrastructure.

The WAPI is a remote API built on top of SOAP that was initially developed inthe wombat project, and which allows data consumers to retrieve data remotely frominformation sources according to a given communication protocol and through a uniformset of objects and primitives. As described in Section 2, the WAPI middleware providesa standard API mechanism that enables a data owner to easily share a data set or anysubset of it. WAPI was also designed to alleviate common data sharing problems such asdata control, access control, extensibility and client updates resulting from data schemamodifications on the server side.

Opening access to the raw data is essential but certainly not sufficient to enablemore advanced data processing and visualization tasks, as will be performed in the up-per layer of the VIS-SENSE framework. Indeed, for visual attack attribution and toenable advanced network correlation based on visual analytics, the data needs to bepreprocessed, enriched, clustered, indexed and analyzed using specific data analyticsalgorithms. Therefore, we have started integrating the triage data analytics frame-work [11] to the VIS-SENSE infrastructure in order to automatically preprocess the rawdata as soon as new data is inserted into any of the datasets.

As previously described in D1.2 (Use case analysis and user requirements), triage isan attack attribution software module that relies on data fusion techniques and leveragesmulti-criteria decision algorithms to cluster security or attack events. Thanks to thisdata triage processing, virtually any type of security events can be automatically groupedtogether based upon a number of common elements (or features) likely due to the sameroot cause. As a result, triage can identify more complex patterns showing varioustypes of relationships among series of attacks or groups of disparate events, giving thusinsights into the manner by which attack campaigns and large-scale attack phenomena

2WAPI: http://sourceforge.net/projects/wombat-api/


http://sourceforge.net/projects/wombat-api/

1 Introduction

are being orchestrated by cyber criminals, and more importantly, revealing also themodus operandi of their presumed authors.

As depicted in Fig. 1.1, triage will be integrated as a central element of the datainfrastructure and will also provide various analytical services to the VIS-SENSE frame-work. While each data set is transformed into a data provider, by using the same WAPImechanism triage becomes now a service and meta-data provider for the visual analyt-ics layer, which is considered as a service/data consumer. A preview of this interactionwith the visualization layer is briefly described in Section 4. However, a more completedescription of the (visual) interactions that will be developed and implemented in the vi-sual analytics modules will be provided in Deliverables D3.3 (Attack Attribution Module)and D4.x, which are dedicated to studying in-depth the integration of new visualizationtechniques with the network data analytics that will be specifically developed towardsimproving attack attribution and security analysis.


2 Interfaces - WAPI v2

2.1 WAPI refresher

Many security datasets allow researchers and threat analysts to access the collected in-formation. However, every dataset often adopts very different solutions to regulate theaccess to the collected data. In practice, a data consumer interested in accessing informa-tion from different sources is often forced to develop ad-hoc plugins by studying in depththe characteristics of the dataset, and maintain specific parsers whenever the datasetAPI is being updated with non-negligible investments in terms of time and resources. Inthe context of the FP7 WOMBAT project, researchers proposed an alternative to thisscenario by proposing a generic interface between data consumers and data providerscalled WAPI, the WOMBAT API.

WAPI allows access to any type of security dataset by means of a standard web servicebased on the SOAP protocol. While full flexibility is left to the data provider to decidethe dataset structure and functionalities, WAPI allows a data consumer to interact witha variety of WAPI-enabled datasets using a single, unified communication protocol.

The WAPI design takes into account a set of constraints commonly imposed by datasources:

• Data control. The WAPI architecture must allow each data source to control thenature of the shared information. It is likely that some sources will be willing toshare only a portion of the information stored in their datasets. Also, sources mustbe able to dynamically pre-process the information provided to data consumers,for instance applying anonymization algorithms.

• Access control. The confidentiality requirements of the various sources can leadto the definition of different trust levels for the data consumers. Data consumersbelonging to different trust levels will be allowed to access information of differentnature. While the practical implementation of these trust levels is not consideredof prime importance in the short term, the WAPI architecture must be easilyextensible to implement more sophisticated types of access control.

• Extensibility. Many sources in WOMBAT are still in an experimental phase andthey are likely to evolve in the next years. This evolution may consist, for instance,

11


in applying new analysis techniques to the dataset, enriching it with new types ofinformation. Therefore the WAPI specification is not bound to a specific set ofprimitives defined a priori for each type of source.

• Client flexibility. The technologies used for the practical implementation of theWAPI must not bind the data consumer to the usage of any specific programminglanguage.

A first prototype of the WOMBAT API was implemented in Python by the WOM-BAT project, and released as an open-source project on SourceForge (http://wombat-api.sf.net).

2.1.1 WAPI concepts

The WAPI specification is based on a set of high level concepts that are borrowedfrom the standard Object Oriented programming paradigm. More specifically, WAPImodels a dataset as a set of object instances, identified by a type and an identifier, andcharacterized by a set of attributes, methods and references.

Attributes are basic information items that are provided by a certain source upon in-stantiation of a WAPI object. For instance, the object modeling an attacker canbe associated to an attribute “IP address”.

Methods are instead queries associated to a given WAPI object to retrieve additionalinformation about it. While the attributes are computed and provided to theWAPI client at every instantiation of a WAPI object, methods allow to retrieveon-demand more complex information (i.e., requiring a more expensive processing).For instance, a geolocation() method can be associated to an attacker object toretrieve information about the geographical location of the corresponding IP.

References are “special methods” that return lists of object instances. A WAPI-enableddata set provides usually different types of WAPI objects which are linked byrelationships. References allow the user to explore the different objects by followingthose relationships, which can be seen as traversals of an oriented graph. Forinstance, a reference method can be provided on a domain name object to obtaina list of all DNS relations that are known for that specific domain name.

The WAPI SOAP API (see Table 2.1.1) allows a client to “discover” the characteristicsof a dataset by means of a set of reflective methods providing a list of defined objecttypes, the methods and references defined for each type, and the attributes associated to


http://wombat-api.sf.net


2.1 WAPI refresher

Method Use

get objects() List all object types defined in a spe-cific dataset.

get methods(object) List all the methods defined for a spe-cific object type.

get references(object) List all the references defined for aspecific object type.

get attributes(object,identifier) List all the attribute names and as-sociated value for a specific object in-stance.

exists(object,identifier) Check for the existance of a specificobject instance in the dataset.

call method(object,identifier,method,**kw) Call a method on a specific object in-stance.

call reference(object,identifier,method,**kw) Call a reference on a specific objectinstance.

get documentation(object) List all the documentation strings forthe object, as well as its methods andreferences.

Table 2.1: WAPI SOAP API



a specific object instance as well as their value. This allows the clients to be completelyindependent from the definition of a WAPI dataset, that can be refined over time byadditing more functionality.

2.1.2 Design and rationales

When designing the data collection infrastructure, the adoption of the WOMBAT APIin the context of the VIS-SENSE data collection framework was a straightforward choicefor a number of reasons.

Firstly, datasets such as HARMUR and SGNET have been originally developed inthe context of WOMBAT, and already offered an exterimental WAPI access since thebeginning of the project. The dataset utilization in the context of VIS-SENSE wasinstrumental to the detection of a number of problems in the original dataset implemen-tations, as it will be discussed in the next Section.

Secondly, WAPI can be considered as a network-based decouplement between thedifferent functional components of the VIS-SENSE architecture. Based on a widelysupported data exchange protocol such as SOAP, WAPI allows the co-existance of amultiplicity of clients developed in different programming languages, and datasets basedon different database technologies (ranging from standard SQL DBMSs such as MySQL,to distributed databases such as Greenplum, to no-sql datasets based on MungoDB).

Finally, being a network-based communication protocol WAPI allows the data collec-tion infrastructure to be distributed, and allows the operation of the different componentswithout any constraint on their physical location.

2.2 Improvements introduced in WAPI v2

The WOMBAT project released a first prototype of the WAPI client and server logicas an open-source project under the terms of the BSD license. The prototype was fullyfunctional, and had been employed by all WOMBAT providers and demoed in a numberof occasions such as the WOMBAT Workshop1 or BlackHat DC 2010 [33]. However,while certain WOMBAT partners made occasional use of the API, the prototype hadnever been properly tested under more significant user loads. Partner interaction in thecontext of the VIS-SENSE project has allowed us to pinpoint many problems of thecurrent implementation, mostly in terms of code maintainability and long-lived interac-tions.

1http://wombat-project.eu/2009/09/wombat-2nd-open-workshop-progr.html


http://wombat-project.eu/2009/09/wombat-2nd-open-workshop -progr.html

2.2 Improvements introduced in WAPI v2

2.2.1 Code maintainability

The original WAPI implementation was based on a heavily edited and customized versionof SOAPpy, part of the Python Web Services project2. The reasons for the customiza-tions mostly rely on the access control requirements of the WAPI. Data owners typicallydesire to control access to the datasetets by preventing access to unauthorized clients.The WOMBAT project decided to address the access control requirements by directlybuilding upon the SSL protocol, and use client certificates to verify the legitimacy ofthe client. Each WAPI server (possibly hosting multiple datasets) is associated to acertification authority that issues signed certificates to all authorized clients. The servertherefore rejects any incoming connection using certificates that have not been signedits own certification authority.

The implementation of a similar access control mechanism is straightforward whenusing the standard openssl libraries. However, the implementation requires direct inter-action with the connection SSL context both on the server (to verify the client certificatesignature) and on the client (to provide a specific certificate to the SSL handshake).The SOAPpy API allows to manipulate the SSL context on the server, but not on theclient. The WOMBAT participants have been therefore forced to extensively modify theclient-side API of the library, but have failed at propagating this modification back tothe maintainer. This led the WAPI distribution to ship with a custom version of theSOAPpy library which is not easily maintained.

2.2.2 Long-lived interactions

While the most prevalent use of the WAPI in its conception was that of short, inter-active sessions where an interested user was querying a dataset for information on aspecific data point, new types of interactions have been explored in the context of theVIS-SENSE project. This includes, for instance, longer sessions associated to the use ofWAPI as a data retrieval method in the context of web-based visualization tools. Usingpython-based web frameworks such as web2py, we have used WAPI to retrieve on de-mand the information required to render a web page providing statistics on the currentstate of a dataset. Similarly, project participants have experimented with data-intensivescripts aiming at correlating local information with remote information available in theWAPI datasets. All these experiments have had two main consequences on the WAPIserver operation: 1) the duration of the WAPI session has increased; 2) the number ofconcurrent sessions running on a server at a given point in time has increased.

This has allowed us to pintpoint two important limitations of the WAPI architecture:

2http://pywebsvcs.sourceforge.net/


http://pywebsvcs.sourceforge.net/


Persistent connections. In the original WAPI implementation, WAPI servers allowedpersistent connections, and WAPI clients were typically establishing a single TCPconnection at startup and reusing it throughout the interaction with a dataset.While this led to no significant problem when running clients for short durations,clients running over several hours or days did not succeed to keep the TCP con-nection alive. Upo connection drop, the client state could not be recovered due tobad coding practices within the original SOAPpy library.

MultiThreading. The original WAPI implementation was leveraging SOAPpy’s supportfor multi-threaded servers to deal with multiple incoming connections at once. Theserver spawned a new thread for every newly accepted connection, and joined thethread whenever the connection was dropped. Since many database connectorsare not thread-reentrant, every newly generated thread required the creation of anew connection to the database. Many DBMSs do not handle well operation overmultiple connections, and this led to a variety of problems and to very frequentdatabase downtimes.

2.2.3 WAPI 2.0

Addressing the issues described above has required us to revisit part of the transportlogic employed by the current WAPI prototype. Seen the limitations associated to theSOAPpy library (originally selected by the WOMBAT project, and currently the onlySOAP parsing library for python), we have decided to a new transport implementationfor the WAPI server and client implementations. We have selected Twisted Python3, anevent-driven networking engine employed in many research projects as well as enterpriselevel products. The transition of both the client and the server logic to Twisted has ledto a series of major advantages in the WAPI performance:

Concurrent connections. Twisted is an asynchronous framework that multiplexes mul-tiple connections within a single thread. While intensive database processing inWAPI is still deferred to a separate thread, the server is now able to handle mul-tiple concurrent clients without any significant impact on the DBMS and on thenumber of connections to it.

Cleaner codebase. While SOAPpy is still used within the Twisted library as a parserfor SOAP envelopes, the transport logic is completely decoupled and based onthe Twisted programming paradigm. This allows full flexibility in the handling

3http://twistedmatrix.com/trac/


http://twistedmatrix.com/trac/

2.3 WAPI over JAVA

of the SSL contexts on the client and the server, without any modifications tothe standard libraries. WAPI 2.0 does not require any more the use of customizedlibraries, and depends on standard components available on any Linux distribution.

Non-persistent connections. Support for persistent connections in WAPI has been dep-recated. The new WAPI client no longer keeps the connection with the server openthroughout the duration of the WAPI session, simplifying the connection manage-ment and better complying with the behavior of standard SOAP web services.While this leads to a decrease in efficiency (the SSL handshake needs to be re-peated for every SOAP method call) the extensive use of caching on the clientimplementation minimize its practical impact on the client performance.

All the modifications to the WAPI codebase have already been propagated to theWAPI SourceForge project, and the new prototype can be downloaded from http:

//wombat-api.sf.net.

2.3 WAPI over JAVA

Within the VIS-SENSE project an extensible framework will be developed to pool projectresults and enable the collaborative use of VIS-SENSE technologies. The frameworkwill be a software application that will serve as the entry point for the use of VIS-SENSE results. The application framework chosen for the implementation of the itscore components is the Java–based Eclipse Rich Client Platform4.

The framework will enable the communication with and orchestration of externalservices. WAPI will be used for the communication with data servers. To enable thesmooth exploration of data sources and the fast acquisition of data, a port of the WAPIclient from Python to Java was carried out. The Java-based client offers the samefunctionality as the Python client, packaged as an Eclipse plugin.

The client will be released open source at the end of the VIS-SENSE project. It willserve as a contribution to the broader usability of the Wombat API.

4http://www.eclipse.org/rcp




http://www.eclipse.org/rcp

3 VIS-SENSE Data Sets

VIS-SENSE will leverage the data collected during the WOMBAT project (EU-FP7) [9]to further enrich the previous data and analysis results with new content, meta-dataand additional APIs and services. The WOMBAT partners have collected diverse setsof security related raw data, enriched this input by means of various analysis techniques,and developed approaches to get a better understanding of the root causes of Internetattack phenomena under scrutiny. In VIS-SENSE, we will build upon those previousdevelopment efforts to integrate the existing data sets within a more comprehensivevisual analytics framework, and enrich them further when necessary.

In this Section, we start by describing two WOMBAT datasets (sgnet and harmur),which will be integrated in the VIS-SENSE framework and further enriched and ana-lyzed, before moving to the description of three completely new datasets (SpamCloud,SpamTracer and BGPDB) that were specifically developed to enable the VIS-SENSEanalysis scenarios defined previously in D1.2.

3.1 Honeypot traces (SGNET)

3.1.1 Overview

sgnet [20], introduced in Deliverable 2.1, is a distributed honeypot deployment aimingat collecting information on the Internet malicious activity. sgnet is the most recentevolution of the research work performed within the Leurre.com project [22]. sgnethoneypots are deployed on low-end hosts provided by volunteering partners interestedin exploiting the data collected by the project.

sgnet integrates different tools, namely ScriptGen [21], Argos [25] and Nepenthes [12]and exploits their characteristics to emulate code injection attacks and collect malware.sgnet benefits from a set of properties that enable it to gather a very peculiar view onInternet attacks and malware.

Firstly, sgnet is protocol agnostic. Following the idea initially proposed in ScriptGen,no assumption is made on the structure of network protocols and on their interaction.Through the usage of bioinformatics techniques, sgnet is able to learn the behaviorof network protocols and thus handle exploits without an a-priori assumption on their

18


behavior. This potentially allows sgnet to handle new or rare exploits that may not besupported by other malware collection solutions such as Nepenthes.

Secondly, sgnet retrieves in depth information on the structure of the observed at-tacks. This information is collected in a central database and presented at differentaggregation levels. Such information is then enriched through a number of analysis toolsorganized in an easily extensible framework.

The information enrichment properties of sgnet allow to correlate the observationswith a large variety of tools that are automatically run on the collected data: geolocationinformation on the origin of the attackers, DNS information, and much more. In thiscontext, the ability of sgnet to emulate code injection attacks up to the point of thedownload of malware samples is extremely valuable.

We refer the interested reader to Deliverable 2.1 for more details on the sgnet archi-tecture as it was conceived in the context of the wombat project. The work on sgnethas continued beyond the wombat project, and we have recently released a major revi-sion of the original deployment. Leaving aside the code engineering improvements, thenew revision leads to a certain number of tangible improvements for the operation of theVIS-SENSE project:

Real time storage. In its original conception, all the logs generated by the honeypotswere stored locally on each machine, and collected on a daily basis for being storedin the central sgnet database. In practice, a delay of approximately 30 hoursexisted between the time an event was observed by a honeypot sensor and thetime in which the information was stored in the database. We have revised thisoperational model to enable real-time storage of all the events observed by thesensors: whenever a new event is observed by a sensor, its characteristics areimmediately pushed to the central database by means of a distributed protocol.

Better shellcode emulation. The original deployment achieved in average a 20% suc-cess rate in the emulation of code injection attacks and download of the associatedmalware. The success rate has been decreasing in the last years due to the increasein sophistication of the shellcodes. Work has been done and is currently in progressto improve this success rate by applying more sophisticated code emulation tech-niques.

UDP support. While the original deployment focused on UDP emulation, we have en-abled sgnet to correctly handle also UDP protocols, and be used effectively forthe emulation of UDP protocols such as SIP.



3.1.2 Data schema exposed

Dataset&

'&constraints)&'&count_sources()&'&count_samples()&'&count_des4na4ons()&'&count_sources()&'&group_by()&'&geo_loca4ons()&'&list_feature_names()&'&get_feature_vector()&

Set$

'&address),)address_numeric&'&first_seen&'&last_seen)'&geo_country)'&geo_city)'&geo_la4tude)'&geo_longitude)&'&get_os()&

Source$'&name),)address),)address_numeric&'&first_seen&'&last_seen)'&profile)

Des+na+on$,)4mezone)'&name),)address&'&first_seen&'&last_seen))'&cpu_info()&'&mem_stats()&'&load_stats()&

Environment$

'&source_addr),)des4na4on_addr),)start_at),)end_at),)dura4on),)number_packets),)av_interreq_4me),)port_sequence_simple),)port_sequence_extended))'&traversal_sequence()&

Session$

,)transport),)src_addr,src_port),)dst_addr,dst_port),)ts_start),)ts_end),)path_crea4on),)path_name),)path_profile)

Injec+on_a5ack$

,)md5),)file_size),)file_descrip4on),)corrupted),)first_appeared)

Malware$

,)run_at),)sc_packer),)sc_type),)dl_uri),)dl_protocol),)dl_port),)dl_host),)dl_filename)

Shellcode$

set&

environment&

events&

sessions&

malware&

des4na4on&

source&

sources&sources&

malware& shellcode&

malware&

split_by_{country,environment,cidr}&

shellcode&

Figure 3.1: The sgnet data schema exposed over WAPI

Figure 3.1 graphically represents the main objects defined in the sgnet WAPI dataset,together with their interrelations by means of references. Central point for the analysisof the sgnet dataset is the set object, that was introduced in the context of the VIS-SENSE project in an attempt to provide a more effective analysis of large amounts ofdata. The set object allows to create a “meta-object” that defines a set of constraintsfor the sgnet events. The constraints are used to select only those events in the sgnetdataset that correspond to specific characteristics. The methods and references of theset object can then be used to run aggregate statistics on the selected data, or look atthe details of each instance by following the reference to the respective objects.

The current implementation of the sgnet set object allows the definition of the fol-lowing constraints on the observed events:



• environment: constrain the set to the events observed on a specific environmentname

• start at,end at: define a timespan and select only events observed within such atimespan

• path id: constrain the set only to events associated to a FSM traversal. An FSMtraversal corresponds to a specific interaction of an attacking source with the in-ternal representation of the protocol knowledge (expressed through a Finite StateMachine). Thanks to the characteristics of the protocol learning techniques em-ployed in sgnet, an FSM traversal is an accurate representation of a specificnetwork interaction, likely to be associated to a specific exploit implementation.

• protocol: constrain the events to a specific transport protocol (e.g. TCP/UDP).

• port: consider only the events associated to a specific TCP/UDP port.

• saddr,saddr prefix: specify an address or a CIDR range, and consider only eventsgenerated by attacking sources within that range.

• country: consider only activities originated from a specific country.

3.1.3 Objects, Methods and References

Table 3.1 provides a comprehensive view of all the WAPI objects defined in the sgnetWAPI dataset, as well as their interconnections by means of references. Each of theobjects is briefly characterized in the following paragraphs.

Dataset

The dataset object is the starting point for the exploration of the sgnet WAPI dataset,and is characterized by a very simple layout. The method list addresses provides a syn-thetic overview of the state of all the honeypot addresses defined in the deployment:their name, their responsible, and their activity statistics (e.g. when was the last time aspecific honeypot address was seen active in the deployment). Only two references aredefined: a reference able to list the currently active environments (honeypot installa-tions) and a reference providing access to the Set object.



WAPI object Description References

Dataset The sgnet dataset object. environmentsset

Set A set of events matching specific characteristics. split by environmentsplit by countrysplit by pathsessionseventssourcesmalware

Source An attacking source. In sgnet, a source identifies anIP address whose activity is never separated by morethan 24 hours or silence. An IP address whose activityfor more than 24 hours is considered a different sourceto model the effects of dynamic addressing.

sameaddresssessionseventsactivities

Destination A honeypot IP address. environment

Environment A honeypot installation. A honeypot environmenttypically comprises three honeypot IPs, thus threeDestination objects.

honeys

Session All the network traffic generated by an attackingsource towards a specific destination.

eventssourcedestinationenvironment

InjectionAttack A code injection attack detected by sgnet betweenan attacking source and a destination.

nextactivitysourcedestinationenvironmentsessionmalwareshellcode

Shellcode A binary code injected into a victim system by a codeinjection attack as part of the propagation vector ofa malware sample. If correctly analyzed by the shell-code handler, a shellcode is associated to high level in-formation on its behavior. A shellcode can be chainedwith other shellcodes.

malwareshellcodenext

Malware A malware sample, injected into a victim by means ofa code injection attack.

events

ActivityClass A class of activities as identified by the EPM model[18].

subactivitiesmalwaresourcesevents

Table 3.1: Summary of sgnet WAPI objects.



Set

As previously explained, the Set object is at the core of the sgnet data analysis. It allowsthe user to select a portion of the dataset events corresponding to specific constraints,and perform aggregate analyses on top of them. The Set object is characterized by asingle attribute (constraints) characterizing the nature of the constraints defined for theobject. A number of methods are defined to compute high-level statistics on the eventset:

• count injections: count the total code injection events included in the set.

• count samples: count the total number of malware samples (distinct MD5 hashes)included in the set.

• count sources: count the total number of sources included in the set (refer tothe Sources description for the formal definition of attacking source in the sgnetdataset).

• count destinations: count the total number of honeypots IPs involved in the ac-tivities included in the set.

• split by environment: return a number of smaller subsets of the original set, wherethe activities have been split according to the honeypot environment in which theyhave been observed.

• split by country: return a number of smaller subsets of the original set, wherethe activities have been split according to the country of origin of the attackingsources.

• split by path: return a number of smaller subsets of the original set, where the ac-tivities have been split according to the way they have interacted with the Script-Gen FSM objects.

• geo locations: return geographical information (latitude, longitude) on each of theattacking sources included in the set.

• group by: count the number of attacking sources belonging to the set, by groupingthem according to combinations of specific criteria. One or more of the followingcriteria can be set:

– day: the day in which the activity took place.



– port sequence simple: the set of TCP ports hit by the attacker in its interac-tion with a destination.

– port sequence extended: the sequence of TCP/UDP interactions performedby the attacker in its interaction with the destination.

– path: the interaction of the attacker with the ScriptGen FSM objects, a morefine-grained identifier of the type of network activity.

– country code: the country of origin of the attacking source

– environment: the honeypot environment hit by the activity

For instance, invoking group by(day=True,country code=True) will return the counteof sources belonging to the set per observation day, per country.

• list feature names: lists the features currently implemented in the sgnet dataset.

• get feature vector: given a feature name (among the list returned by list feature names),the method returns the value of the feature for every event belonging to the set.

Source

In sgnet the definition of source goes beyond the simple IP address. In order to takeinto account the bias introduced by dynamic addressing, the dataset consiers the activityof the same address reappearing after a period of silence as the activity of a differentsource. In practice, an sgnet source is defined as a specific IP address whose activityis not separated by more than 24 hours of inactivity.

The Source object in the sgnet WAPI datset is characterized by the following at-tributes:

• address: the address represented in string form (e.g. “10.0.0.0”)

• address numeric: the address represented in numerical form (e.g. 167772160)

• host name: the hostname, obtained through reverse resolution

• first seen: the timestamp of the first activity generated by the source

• last seen: the timestamp of the last activity generated by the source

• geo country: the country of origin of the attacker determined through geolocationlibraries

• geo region: the region of origin of the attacker



• geo city: the city of origin of the attacker

• geo latitude,geo longitude: latitude and longitude of the attacking source

Finally, the getos method offers statistics on the operating system of the attackinghost determined by means of passive OS fingerprinting techniques.

Destination

The Destination object models a destination IP address for an attack, thus a honeypotIP. A Destination is characterized by the following attributes:

• name: the hostname of the honeypot address

• address: the address represented in string form

• address numeric: the address represented in numerical form

• profile: the identifier of the sample factory OS configuration currently emulatedby the address

• first seen: the timestamp of the first activity received by the destination

• last seen: the timestamp of the last activity received by the destination

Environment

An Environment object models a honeypot installation. In the current sgnet deploy-ment, every honeypot installation is associated to 3 distinct honeypot IP addresses.Consequently, an Environment object is always associated to three Destination objects(they can be retrieved through the honeys reference).

A Destination object is characterized by the following attributes:

• name: the name of the installation (typically, the name of the organization hostingit)

• address: IP address of the management host (used for honeypot maintenance)

• timezone: timezone the honeypot is located in

• first seen: the timestamp of the first activity received by the honeypot

• last seen: the timestamp of the last activity received by the honeypot



For system management purposes, the Environment object also provides access to anumber of statistics on the current operation of the system by means of the followingmethods:

• cpu info: return information on the number of CPUs, arhitecture and model avail-able to the honeypot platform

• load stats: information on the CPU load of the system over time

• mem stats: information on the RAM/swap usage of the system over time

• uptime stats: information on the honeypot daemon uptime

Session

The Session object models all the network traffic exchanges between a Source and aDestination. This can encompass multiple TCP sessions and UDP flows, and can beassociated to a duration of a few minutes or several days depending on the type ofactivity under consideration.

A Session object is characterized by the following attributes:

• source addr: the source IP address

• destination addr: the destination IP address

• start at: timestamp of the first packet exchanged in the Session

• end at: timestamp of the last packet exchanged in the Session

• duration: duration in seconds of the Session

• number packets: total number of packets exchanged in the Session

• av interreq time: average time in seconds between two consecutive packets

• port sequence simple: the set of TCP ports contacted by the Source during theSession activity, where every port appears only once.

• port sequence extended: more fine-grained characterization of the activity: it cor-responds to an ordered list of all the TCP session and UDP flow destination portsgenerated within the sessions, as well as ICMP exchanges. If, for instance, anattacking source generates an ICMP request, followed by a first connection to



port TCP 139, an exchange on port UDP 137, and a third connection againto port TCP 139, its port sequence simple will correspond to: —139 while itsport sequence extended will correspond to: I80—T139—U137—T139.

Finally, a method traversal sequence is defined in the Session object to retrieve infor-mation on the interaction of the activity with the protocol learning techniques employedby sgnet. The output of the method provides a very frine-grained way to characterizethe activities, and discern for instance different exploit implementations targeting thesame port.

InjectionAttack

An InjectionAttack object models the detection of a successful code injection attackagainst one of the sgnet honeypots. Whenever a code injection attack is detected,the sgnet deployment attempts to identify the shellcode meant to be injected in thehijacked control flow of the victim, and tries to understand its behavior. If successful, theshellcode is emulated by the honeypots and a malware sample is ultimately downloaded.The object is therefore characterized by the following attributes:

• transport: the transport protocol involved (UDP/TCP)

• src address,src port: source IP address and port

• dst address,dst port: destination IP address and port

• ts start: timestamp of the first packet of the flow involved in the code injection

• ts end: timestamp of the last packet of the flow involved in the code injection

• path name: name of the FSM traversal associated to the code injection attack. Aspreviously explained, the FSM traversal univocally identifies the network interac-tion, and possibly the exploit implementation involved.

• path profile: the OS profile of the sample factory that contributed to the generationof the protocol model

• path creation: date in which the specific FSM traversal was created

• path active: boolean flag to express whether the FSM traversal is currently active



It should be noted that a code injection attack can be modeled as a sequence ofobjects, starting from the InjecionAttack, to the injected Shellcode, to the Malwaresample downloaded as a consequence of the Shellcode emulation. In certain “multi-stage”attacks, multiple shellcodes may be even chained one to the other. This is modeledin WAPI through the next reference, that is implemented in the InjectionAttack andShellcode objects and aims at representing this sequence.

Shellcode

A Shellcode binary injected in sgnet by a successful injection attack, and its associatedanalysis by means of the shellcode handler. It is characterized by the following attributes:

• run at: timestamp associated to the moment in which the shellcode was analyzed

• sc packer: if the analysis was successful, provides information on the type of packeridentified in the shellcode by the shellcode handler.

• sc type: if the analysis was successful, provides information on the type of shellcodeidentified.

• dl uri: URI representing the intended behavior of the shellcode. For instance, ashellcode aiming at opening a listening port on the TCP port 9988 will be repre-sented as bind://0.0.0.0:9988. Similarly, a shellcode aiming at downloading a sam-ple through the ftp protocol may be represented as ftp://10.2.3.1/get/malware.exe.

• dl protocol: the type of download protocol used by the shellcode.

• dl port: the port involved in the malware download

• dl host: the hostname involved in the malware download

• dl filename: the remote file name of the shellcode (when applicable).

Malware

A Malware sample successfully downloaded by sgnet as a result of the correct emulationof the previous stages of a code injection attack. It is characterized by the followingattributes:

• md5: the MD5 hash of the binary

• file size: the size of the binary


3.2 Client-side threats (HARMUR)

• file type: the file type as identified by libmagic.

• first appeared: the timestamp of the first successful download of the sample.


3.2.1 Overview

harmur [19], the Historical ARchive of Malicious URLs, is a repository of informationon the characteristics and the dynamics associated to web-related threats. harmur col-lects information on domains that are believed to be suspicious and malicious by a varietyof different security sources. For each suspicious domain, harmur tries to look at thecharacteristics of the hosting infrastructure (web servers, DNS information, geographicallocation of the servers and hosting Autonomous System), at the domain registration in-formation (WHOIS data) and at the security information (retrieved from Norton Safeweband Google SafeBrowsing). These information sources have been recently extended withinformation believed to be useful in the analysis of drive-by-downloads, namely basicinformation on the hosted content (HTML content and referenced javascript) and redi-rection chains. All the harmur information sources are reiterated over time for eachtracked domain giving priority to those domains that are believed to be “most interest-ing” according to a set of heuristics. This allows harmur to build a timeline of events foreach domain of interest, timeline that we believe to be extremely helpful to characterizethreats and the modus operandi of the individuals behind them.

The detailed architecture of the harmur data colleciton framework is exposed inFigure 3.2. The architecture is composed of two basic types of components: URL feedsand analysis modules.

URL feeds

URL feeds regularly generate new, possibly interesting URLs that should be analyzedby the harmur framework. URL feeds may have a different level of confidence in themaliciousness of the URL: at the time of writing, 17% of the tracked domains havenever been proved to be malicious. Non-malicious domains get a very low priority in theharmur internal scheduler for analysis, and therefore do not significantly impact theperformance of the framework. The following URL feeds are currently implemented inharmur:

Norton Safeweb: lists of domains generated on a daily basis by Symantec operations,and that are considered as malicious with very high confidence.



Norton&Safeweb&

MDLs&

Phishtank&

WAPI&submissions&

HARMUR&

DNS&NS/MX/A/PTR/CNAME&resource&records&&

WHOIS&registrant/registrar&informaCon&

ADDRESS&geolocaCon,&autonomous&system&

SERVER&HTTP/HTTPS&port&reachability,&server&version&

CONTENT&collecCon&of&inlined/referenced&js,&redirecCon&chains&

SECURITY&threat&info&from&SafeWeb&and&Google&SafeBrowsing&

Exposure&

Figure 3.2: Overview of harmur data collection.

MDLs: Various well-known Malware Domain Lists (http://malwaredomainlist.com,http://www.malwareurl.com, http://www.hosts-file.net). The quality of thisfeed is typically lower, since it’s mostly based on user contributions on heuristicsbased, for instance, on the presence of keywords of the domain name being regis-tered.

Phishing domains: phishing domains reported by PhishTank on a daily basis.

Exposure: lists of malicious domains identified by Exposure (http://exposure.iseclab.org). Exposure analyzes DNS dynamics to spot cases that are indicative of theuse of a domain for malicious purposes.

WAPI: thanks to the harmur WAPI interface, it is now possible for any data consumerto directly upload URLs of interest to harmur.


http://malwaredomainlist.com

http://www.malwareurl.com

http://www.hosts-file.net

http://exposure.iseclab.org

http://exposure.iseclab.org


Analysis modules

DNS module. The DNS module is in charge of collecting as much information aspossible on the DNS infrastructure underlying each analyzed domain. The informationincludes:

• A Resource Records. Mapping a hostname to the list of IP addresses associatedto it

• CNAME Resource Records. Defining aliases between names.

• NS Resource Records. Mapping a domain to the list of its authoritative nameservers

• MX Resource Records. Mapping a domain to the list of its mail servers.

• PTR Resource Records. Mapping an IP address to its reverse resolution.

WHOIS module. The WHOIS module attempts to query the WHOIS databaseto retrieve information on the domain registration. More specifically, harmur tries toextract from the WHOIS record information on the registrant (name and email), theregistrar, the registration date and the registration expiry date. However, as explainedin Deliverable 2.1, RFC 3912 states that the WHOIS protocol delivers its content in ahuman-readable format, rendering automated parsing of WHOIS messages problematic.As of now, only 40% of the harmur domains are associated to fully parsed WHOISdata.

ADDRESS module. Each IP address observed by harmur is enriched by thismodule with geolocation information (generated thanks to the Maxmind geolocationlibrary1) and information on the Autonomous System number the address currentlyresides in. The latter information is generated by querying Team Cymru’s IP-to-ASNmapping service2.

SERVER module. For each IP address observed by the system, the Server moduletries to understand the characteristics of the underlying physical system. In practice,the module tries to interact on the HTTP/HTTPS ports with simple HEAD messages,and in case of reply from the server it collects server version information from the HTTPheaders.

CONTENT module. The content module is a recently added module aiming atcollecting partial information on the content hosted on a specific web server. For any

1http://www.maxmind.com2http://www.team-cymru.org/Services/ip-to-asn.html


http://www.maxmind.com

http://www.team-cymru.org/Services/ip-to-asn.html


URL associated to “high relevance” threats according to the harmur information, themodule visits the page, and logs the following information:

• Any HTTP redirect encountered during the crawl process (the crawler will thenfollow the redirect).

• Any IFRAME present in the HTML content (the crawler will recurse to the refer-enced page).

• Any SCRIPT tag present in the HTML content (recursing to the referenced scriptif the javascript is not inlined).

• The HTML content of the page(s) visited during the crawl activity.

SECURITY module. The security module is in charge of analyzing the securitylevel of each URL and domain analyzed by harmur. Currently, the security module in-terfaces itself to the Norton SafeWeb API and to the Google SafeBrowsing one, althoughfurther extensions are in the works.


Figure 3.3 graphically represents the structure of the harmur WAPI dataset by repre-senting its main objects. The harmur WAPI dataset provides complete access to allinformation collected by the analysis module, and allows to browse the timeline of eachdomain.

Before diving into the details of the dataset, some high level design constraints shouldbe taken into consideration:

1. Similarly to other WAPI datasets such as sgnet, the harmur dataset objectoffers a reference to a central starting point for any aggregate data analysis: theset object. The set object allows to define a “view” of the dataset by specifyinga certain amount of constraints. Only the events matching the defined constraintswill be defined in the set.

• source: the name of the URL feed that first included the domain into theanalysis

• as number/as name: the autonomous system in which the web hosting in-frastructure is located

• registrant name: the name of the registrant indicated in the WHOIS database



Dataset&

'&constraints)&'&as_ranking()&'&count_servers()&'&count_domains()&'&geoloca4on()&'&auton()&'&geo_loca4ons()&'&list_feature_names()&'&get_feature_vector()&

Set$

,)name)'&current_color),)first_seen&'&last_analyzed&'&whois_registrant),)whois_registrar),)whois_created_at),)whois_last_updated_at))'&summary()&'&security_checks&'&load_stats()&

Domain$

set&

autonomous_system&

split_by_{threatclasses}&

,)run_at)'&analyzer),)color&

SecurityState$

,)source)'&tags),)url&'&url_scheme&'&url_netloc),)url_hostname),)url_password),)url_path)

URL$

,)id)'&help),)type&'&type_descrip4on)&'&count_domains()&'&locate()&

ThreatClass$

,)name)'&first_seen),)last_seen)&'&locate()&

Host$,)address)'&address_numeric),)first_seen&'&last_seen),)geo_country),)geo_city),)geo_la4tude),)geo_longitude)

'&locate()&

Address$

,)prefix)'&as_number),)as_name&'&registry),)allocated),)country),)first_seen),)last_seen)

AutonomousSystem$

addresses&

threatclasses&

domains&

urls&

security_states&

hosts,&

mailservers,&

nameservers&

threats&

various&DNS&

rela4ons&

autonomous_system&

threats_found&

Figure 3.3: The harmur data schema exposed over WAPI

• registrar: the name of the registrar

• cidr network/cidr prefix: allows to specify a specific CIDR IP range in whichthe hosting infratructure should be contained

• domain keyword: a specific keyword to be contained in all the domain names

• hostname: a specific hostname of interest

• version: a specific version string

• threat type: a specific type of threat (e.g. ‘BREXP’ for browser exploits; thedataset method list threat types() lists all the types currently defined)

• threat id: a specific threat identifier (the dataset method list threat ids() listsall the IDs currently defined)

• seenred ts start/seenred ts end: only the domains witnessed as malicious inthe specified timespan will be considered



• ts start/ts end: only the domains analyzed within the defined timespan willbe considered

Multiple constraints can be of course defined when creating a set. The set imple-ments then a number of aggregate methods (to compute aggregate statistics onthe events considered) and references (to access the included objects) to explorethe content of the view defined by the above constraints.

2. Whenever applicable and supported by the underlying database schema, harmurmethods and references accept two optional arguments: ts start and ts end. Thesetwo arguments allow to select solely the results generated by harmur in the timespamn [tsstart, tsend], thus providing acess to the dynamic connotations of thedataset. Through these two arguments, it is possible for instance to analyze howthe DNS resolution of a specific hostname has evolved over time by consequentlycalling the same reference while sliding the time window.


Table 3.2 provides a summary of all the WAPI objects defined in the harmur dataset,together with the references currently implemented on each of the objects.

Dataset

The dataset object is the starting point for traversing the harmur dataset. The datasetobject provides a set of commodity methods to analyze the current status of the dataset.More specifically:

• count domains: simply provides the current count of domains being tracked byharmur.

• list threat ids: lists the threat types (the high level threat categories) currentlyknown to harmur.

• list threat ids: lists the threat IDs currently known to harmur.

• list sources: lists the different URL feeds that contributed to the harmur URLcollection.

As shown in Table 3.2, the dataset provides references to access the main WAPIobjects.




Dataset The harmur dataset object. domainthreatclassserverurlcontentsubmissionset

Set A set of harmur events matching given constraints autonomous systemsurlsdomainsaddressesthreatclassescontentsplit into threatclasses

Domain A Fully Qualified Domain Name (FQDN) tracked byharmur. The harmur framework builds upon thenotion of domain all the analysis and scheduling de-cisions

same registrantsecurity statesurlsthreatshostsmailserversnameserverscrawl rootscrawl childrencrawl parentscontent

SecurityState Represents the security state of a domain accordingto a specific security information source (e.g. NortonSafeweb) at a given point in time.

threats founddomain

URL Represents a URL, that may or may not be associatedto a set of threats. Every URL is associated to oneand only one domain and to one host.

domainhostthreatscrawlactivitiescontentcrawlchildrencrawlparents

ThreatClass Represents the high level description of a threat andis always associated to a threat and to a unique iden-tified (threat ID).

instancesautonomous systems

Host Represents the information collected by harmur ona specific host name.

dns adns ptrdns cname todns cname from




Address Represents the information collected by harmur ona specific IP address.

dns ptrdns aautonomous system

AutonomousSystem Represents the association of a specific IP block toan Autonomous System. Large Autonomous Systemsmay span over several disjoints IP ranges, and maytherefore be associated to multiple objects in har-mur.

threats

CrawlRoot Represents the starting URL of a crawl activity. nextcontentall content

CrawlStep Represents an edge of the crawling tree, and asso-ciates a source URL to a destination URL, togetherwith a type of redirection (e.g. iframe redirection, orHTTP redirection)

urlsiterootparentnextcontent

Content A portion of the website content collected during thecrawling process, identified by its MD5 hash.

downloaded fromroots

Submission A meta-object allowing the client to submit batchesof URLs by uploading compressed CSV files

Table 3.2: Summary of harmur WAPI objects.

Set

As previously explained, the set is the equivalent of a “view” on the dataset definedthrough a set of constraints, that identify the object. While a set is an abstract objectand is not therefore associated to any WAPI attribute, it provides a number of methodsto perform aggregate statistics on the selected information:

• as ranking: computes a ranking of the top n Autonomous Systems hosting thecurrently selected domains. When creating a set of domains associated to theBREXP (browser exploit) threat ID, this method provides a high level overviewof the mostly affected Autonomous Systems.

• count servers: count the number of servers (IP addresses) belonging to the set.

• count domains: count the number of domains belonging to the set.

• count analyzed: count the number of domains belonging to the set that have beenanalyzed by harmur at least once (newly inserted domains may not have yet beentaken into account by harmur, skewing the analysis).



• color stats: compute the number of domains in the set currently having a specific“color” in harmur. This method is useful to understand the proportion of domainscurrently infected (red), known to be benign (green), previously infected but nowcleaned (orange) and with unknown security state (gray).

• geolocation: returns geolocation information (latitude, longitude and country) forall the servers belonging to the set.

• list feature names: list the features currently available in harmur.

• get feature vector: given a feature name (as returned by (list feature names))) itreturns the value of the feature for every domain belonging to the set.

Domain

A Fully Qualified Domain Name tracked by harmur. Every domain object is charac-terized by the following attributes:

• name: the domain name.

• current color: the color of the domain as it has been derived in the last securityanalysis.

• first seen: timestamp representing the first time in which the domain has beenobserved by harmur.

• last analyzed: timestamp representing the last time in which the domain was an-alyzed by harmur.

• whois created at: timestamp representing the registration date of the domain ac-cording to the WHOIS records.

• whois last updated at: timestamp representing the last time in which the WHOISrecord was updated in the registrar.

• whois registrant: the registrant name.

• whois registrant email: the registrant email.

• whois registrant first seen: the first date in which the registrant ever registered adomain according to the harmur records.



• whois registrant last seen: the last date in which the regisrant ever registered adomain according to the harmur records.

• whois registrar: the registrar name.

Additionally, the following methods are provided by the domain object:

• hashes: returns the list of all the MD5 hashes of content retrieved by the crawlingactivity while visiting any of the domain URLs.

• summary: returns structured information on the overall domain status with re-spect to its hosting infrastructure (DNS records, Autonomous System information,geolocation information, ...).

• security checks: returns the timestamp of all the security analyses that harmurhas ever performed on the domain.

While most of the references listed in Table 3.2 have straighforward meaning, a fewof them need special attention:

• same registrant: returns a list of domains registered by the same registrant as thecurrent domain.

• crawl roots,crawl children,crawl parents: while the first reference returns a list ofCrawlRoot objects associated to URLs belonging to the domain, the latter tworeferences return object of type domain. Modeling the crawling activity as thetraversal of a crawl tree, the children of a domain represent all those domains thatthe crawler has been redirected to after visiting the current domain. Conversely,the parents of a domain represent all those domains that have redirected the crawlerto the current domain.

SecurityState

This object simply models the current security state of a domain at a given point intime according to a specific source of security information. It is therefore characterizedby the following attributes:

• run at: timestamp representing the instant in which the security information wasgenerated.

• analyzer: the name of the security information source involved in the analysis.

• color: the color associated to the domain



URL

A URL tracked by harmur. In the harmur data model, a URL is univocally associatedto a specific domain and to a specific hostname. For data analysis convenience, allharmur URLs are canonicalized and stored in the database in parsed format. A URLhas therefore the following attributes:

• url: the full URL string.

• tags: keywords associated to the URL.

• source: the name of the URL feed that first introduced the URL into the dataset.

• url scheme: the scheme of the URL (e.g. ‘http’).

• url netloc: the network location of the URL (e.g. ‘www.google.com:80’)

• url hostname: the hostname associated to the URL (e.g. ‘www.google.com’)

• url username,url password: username and password encoded in the URL (if any)

• url path: the path of the URL

The structure of the URL methods and references is somehow similar to what wehave previously seen for the Domain object. The method hashes allows to retrieve thelist of all the MD5 hashes of content downloaded from a specific URL, while referencesexist to retrieve all the crawl roots for a URL (crawlactivities) or to retrieve all theparents/children of a URL with respect for all the crawl trees ever generated by harmur.

ThreatClass

A ThreatClass models a high level threat as defined by the different security sourcescurrently integrated in harmur. A ThreatClass object is currently associated to thefollowing attributes:

• id: the specific threat ID

• type: the high level type of the threat (e.g. BREXP, FAKEAV, ...)

• type description: a more verbose description of the threat type (when available)

• help: when possible, a URL link to a web page describing the threat



On top of the references described in Table 3.2, the following methods are defined:

• count domains: count the total number of domains ever associated to this threat.

• locate: returns geographical coordinates of all the IP addresses which have everbeen associated to this threat.

Host

A Host object models a hostname tracked in the harmur framework. A Host object istypically associated to a number of URLs but always belongs to a single domain. TheHost object is characterized by the following attributes:

• name: the hostname

• first seen: timestamp of the first time in which the hostname appeared in thecontext of a harmur analysis.

• last seen: timestamp of the last time in which the hostname appeared in thecontext of a harmur analysis.

While the Host is associated to a single method, locate, that returns geographicalinformation on the physical location of the host, many of the references described inTable 3.2 are not of straightforward meaning:

• dns a: follows all the DNS A resource records defined for the host and returns alist of Address objects.

• dns ptr : uses all the DNS PTR resource records to return the list of Addressobjects that points to the current hostname with PTR records.

• dns cname to: follows all the DNS CNAME records having as source name thecurrent hostname and returns a list of Host objects that the current object pointsto.

• dns cname from: follows all the DNS CNAME records having as destination namethe current hostname and returns a list of Host objects that point to the currentobject.



Address

An Address object represents an IPv4 address in the harmur framework. It is charac-terized by the following attributes:

• address: the address represented in string form (e.g. “10.0.0.0”)

• address numeric: the address represented in numerical form (e.g. 167772160)

• first seen: timestamp of the first time in which the address appeared in the contextof a harmur analysis.

• last seen: timestamp of the last time in which the address appeared in the contextof a harmur analysis.

• geo country: country in which the address was localized at the time of the firstanalysis

• geo city: city in which the address was localized

• geo timezone: timezone information for the address location

• geo latitude: latitude of the address location

• geo longitude: longitude of the address location

Similarly to the Host object, the address implements a set of DNS-based references:

• dns ptr: returns a list of Host objects pointed by DNS PTR records (in simplewords, the reverse resolution of the address).

• dns a: returns a list of Host objects that point with DNS A resource records tothe address. The returned information is generally different from that returnedby dns ptr : the reverse resolution of an address always returns a single hostname,but many hostnames generally resolve to the same address. For instance, thereverse resolution of 173.194.34.49 points to the name “par03s03-in-f17.1e100.net”,while hosts such as “www.google.com”, “www.picasa.com”, “scholar.google.com”all resolve to that address.



AutononousSystem

An AutonomousSystem object represents the association of an IP range to a specificAutonomous System number (ASN). In many cases, an ASN is associated to a singleIP range and a single AutonomousSystem object may exist in harmur to representit. For large Autonomous Systems, however, multiple IP ranges are often defined, thusmultiple objects for the same ASN. An AutonomousSystem object is characterized bythe following attributes:

• prefix: the IP prefix (in CIDR notation)

• as number: the Autonomous System number

• as name: the Autonomous System name

• registry: the registry responsible for the AS

• allocated: the date in which the AS was first allocated

• country: the country the AS is registered with

• first seen: timestamp of the first time in which the AS appeared in the context ofa harmur analysis.

• last seen: timestamp of the last time in which the AS appeared in the context ofa harmur analysis.

CrawlRoot

A CrawlRoot object represents the starting point of a crawling activity. Wheneverthe harmur analysis modules identify a URL associated to a threat of interest andtherefore initiate a crawling activity, that URL is associated to a CrawlRoot, the rootof the crawling tree whose edges are composed of redirects of different kind and whosenodes are intermediate URLs hosting portions of the content of the page. A CrawlRootis therefore characterized by the following attributes:

• url: the starting URL

• run at: the timestamp in which the crawl was initiated

• user agent: the HTTP user agent advertised by the crawler

• referrer: the content of the HTTP referrer field used by the crawler



Additionally, the following references are defined:

• next: returns the first CrawlStep object of the crawl tree (if any)

• content: returns the Content object directly associated to the URL (if any)

• all content: returns all the Content objects downloaded as a result of the crawlingactivity

CrawlStep

A CrawlStep object represents and edge of the crawl tree. A CrawlStep object is char-acterized by:

• src url: the origin URL

• dst url: the destination URL

• type: the type of redirection/reference (e.g. ‘iframe’,‘http’,...)

Additionally, among all the defined references the following ones are worth mentioning:

• parent: returns the previous step of the crawling activity. The returned object canbe a CrawlStep object, or a CrawlRoot.

• next: return the next step of the crawling activity.

• content: returns the content object retrieved from the destination URL.

Content

A Content object models any type of binary or text content retrieved by harmur in thecontext of the crawling activities. It is characterized by the following attributes:

• hash md5: the MD5 hash of the binary content

• file size: its length

• file type: the file type identification as returned by libmagic.

• first seen: timestamp of the first time in which the content appeared in the contextof a harmur analysis.



• last seen: timestamp of the last time in which the content appeared in the contextof a harmur analysis.

A Content object provides references to retrieve the list of URLs from which it wasdownloaded from, as well as the list of crawl roots that led to its download. It shouldalso be noted that, through the download method, it is possible to directly downloadfrom harmur the binary content, that is returned HEX-encoded.

Submission

A special role is given to the Submission object. Differently from all the other objects,Submission objects do not provide any primitives to retrieve data from the dataset.Conversely, they are used to submit batches of relevant content to harmur. At thecore of the Submission object is the upload method, that receives as argument a databuffer encoding the URLs to be uploaded to harmur. The data buffer is required tobe a gzipped file containing one URL per line, and encoded with one of the currentlyavailable encoders (at the time of writing, base64, UU and hex encoding are supported.The method returns whenever all the submitted URLs have been submitted successfullyto the harmur dataset.

3.3 Spamtrap data (SpamCloud)

3.3.1 Overview

In the third use case of the scenario “Visualization of the threat landscape”, the VIS-SENSE project aims at investigating the global behavior of spam botnets from a strategicperspective, by analyzing and correlating spam campaigns performed through spammingbotnets across different email features. The main goal is to leverage visual analytics tohelp security analysts understand the modus operandi of spammers controlling thosebotnets and how these are used for spam campaigns operations [29].

The primary data that will be used for this use case is provided by Symantec.cloud(formerly known as Message Labs). As part of their continued business, Symantec.cloudsets up and maintains a very large number of spam traps all around the world. All emailtraffic sent to those spam traps is analyzed by honeypots that extract various featuresfrom the emails, including headers, message content, sender’s IP address, name of thebot (if available from CBL [13] rules), embedded URIs, etc. For analysis purposes (e.g.,general trends, global spam statistics), the spamtrap traffic is sampled on a daily basiswith about 10,000 random samples stored every day in a SQL database. A sampling is



required as the actual spam volume intercepted globally and blocked by the companyis overwhelming (several billions messages a day) and thus impossible to store entirely.From January 2012 onwards, the data collection infrastructure and analytics platform ofSymantec.cloud was completely re-engineered to increase quite significantly the numberof email samples being stored and analyzed for intelligence and trend analysis, withapproximately 4 million spam collected and analyzed on a daily basis.

For the VIS-SENSE project, we have leveraged this valuable source of informationto build a new representative spam dataset called SpamCloud, which is automaticallyfed by the spamtrap data source maintained by Symantec.cloud. The SpamCloud datacollection process was started in October 2010. Until January 2012, about 10,000 spamsamples were automatically copied every day into the data set. Due to the modificationsbeing made recently by Symantec.cloud, the collection process had to be suspended fora few weeks in January 2012. From March 2012, SpamCloud was fed again by the.cloud spam data source and about 2,000 spam samples were again inserted every hour,which makes approximatively 50,000 new spam messages per day. Fig. 3.4 illustratesthe SpamCloud data collection process.

For every spam message, a number of spam characteristics are being collected andinserted into the SpamCloud database. Those spam features can be classified into thefollowing categories according to the aspect of the spam activity they represent:

• bot-related features: these are features related to the type of bot that has likely sentthe spam message, such as the bot name or peculiarities in the SMTP dialog (e.g.,a specific HELO string). This information is retrieved using the CBL rules [13]during the SMTP session. Although these rules are able to identify many bots,they sometimes fail because either the bot exhibits a new pattern or the host isnot a bot. However, this feature turns out to be of great help to study spammingbotnets.

• host-related features: those features characterize the machine that has sent spamin terms of its intrinsic properties, and which shouldn?t normally be altered by thespammer. Examples include the machine IP address, geo-location, host name (re-trieved via reverse DNS lookup), and operating system (obtained through passiveOS fingerprinting).

• message-related features: these features describe the characteristics of the spammessage. Unlike host-related features, they cannot be used to identify precisely themachines that send spam. However, they can be leveraged to study spam campaignsand spam botnets inter-relationships. Examples of such features include specific



fields from the message header (From and To domains, character set, content-type, subject line), the main topics of the email, the message size, embeddedURI’s, attached filenames, or the language and content of the message.

Most of those features have been described previously in the VIS-SENSE DeliverableD3.1 (Specifications of the Network Analytics Algorithms) - Section 2.1 (Analysis ofattack features). Note that some of those fields (e.g., spam trap domains appearingin certain header fields) had to be anonymized before being provided to VIS-SENSEpartners.

SpamCloud

Spam%emails%(%~%50,000%/%day%)%

Bot%:%7%bot%signature%7%OS%details%

Sending%Host%:%7%IP%address%/%subnets%7%Hostname%7%Country,%ISP,%ASN%

Email%:%7%from/to%domains%7%Fmestamp%7%full%header%+%body%7%subject%line%7%embedded%URI’s%7%aNachments%7%SMTP%commands%7%character%set,%encoding%7%message7id%7%language%

SPAM%FEEDS%

ANALYSIS%

Figure 3.4: Overview of SpamCloud data collection.


Figure 3.5 represents the SpamCloud data set that is WAPI-enabled, showing thus allobjects and methods that are accessible through WAPI together with their interrelationsby means of references. A central point for starting a spam analysis or retrieving newspam samples is the SpamSet object. As explained before, the set object was introducedin the context of VIS-SENSE to provide a more effective way of retrieving and analyzinglarge amounts of data.

The set object allows to create a “meta-object” that defines constraints for retrievinga set of events from a WAPI dataset. The constraints are similar to “WHERE” clausesin SQL and are used to select only those events (i.e., spam messages in the case ofSpamCloud) that correspond to specific characteristics. The methods and referencesof the set object can then be used to compute aggregate statistics on the selected data,



or to retrieve a large number of data samples needed for running a batch analysis byretrieving the details of each message (by looping over all spam objects).

The current implementation of the SpamSet object allows the definition of the followingconstraints on the collected spam messages:

• start at,end at : define a timespan (using timestamps) and select only spam col-lected within this time period;

• from: constrain the set to spam messages sent using a specific From domain;

• uri, uri domain: consider only spam messages having a given set of URI’s (or URIdomains) embedded in their body;

• saddr,saddr classA, saddr classB, saddr classC : specify an address, or a class A/B/CIP subnet and consider only events sent by spamming sources within that IP range;

• bot : consider only spam originating from a specific botnet;

• charset : constrain the set to spam messages encoded with a specific character set;

• subject, subj keywords: constrain the set to spam messages having a given subjectline or specific keywords in it;

• host : constrain the set to spam messages sent from a machine having a specifichost name;

• lang : consider only spam written in a specific language;

• country : consider only spam originating from a given country.

Next to the Dataset and SpamSet object, the SpamCloud WAPI data set also pro-vides two other types of objects: the Email and Sender objects, which are furtherexplained below together with the associated methods and references for browsing thedata set.


Dataset

As for all other WAPI datasets, the Dataset object is the starting point for traversingthe SpamCloud dataset. This object provides a set of utility methods to get informationon the current status of the dataset:



Dat

aset

- constraints - count_sources() - count_emails() - count_rcpt() - count_from() - group_by() - geo_locations() - list_feature_names() - get_feature_vector()

SpamSet

- date / day - from - rcpt_to - subject - uri - uri_domain - uri_tld - attach - lang - smtp-helo

Email

- ip_addr - classC / B / A - hostname - country - country_code - lat / long - bot - x_p0f_detail - x_p0f_signature

Sender

set$

sender$

email$

email$

email$

sender$

Figure 3.5: The SpamCloud data schema exposed via WAPI.

• count messages: returns the current number of spam messages in the dataset;

• list botnets: returns a list of spamming botnets currently known in Spam-Cloud;

• list charsets: lists all charsets that have been observed so far;

• list languages: lists all languages that have been observed so far;

• last insert: returns the date of the last inserted spam;

As shown in Fig 3.5, the dataset provides references to access the main WAPI objects.

Email

An Email object represents a spam email of the SpamCloud data set. It can be instan-tiated through the email reference method, starting from the dataset or from any otherobject. Any Email object is characterized by the following attributes:



• date: the sending date of the spam email (in date-time representation);

• day: a string representing only the sending date (e.g., ”2012-03-15”);

• from: the From domain of the email;

• rcpt to: the domain of the recipient (To domain);

• subject: the subject line of the email;

• uri, uri domain, uri tld: the set of embedded URI’s (or URI’s domains or URI’sTLD’s);

• attach: the set of attachment filenames;

• lang: the language used for the email;

• smtp helo: the HELO string used in the SMTP dialog.

No particular method is currently implemented on the Email object, apart from thereference sender method, which allows one to retrieve a Sender object correspondingto a given spam email.

Sender

A Sender object represents a spamming machine and provides the following attributes:

• ip addr: the source IP address of the spamming machine;

• classC, classB, classA: only the class C/B/A part of the source IP address;

• hostname: the host name of the spamming machine;

• country, country code: the country (code) of origin of the source IP address;

• lat, long: the latitude and longitude of the source IP address;

• bot: the name of the bot that has sent spam (according to the CBL rules);

• x p0f detail: the full label of the operating system of the spamming machine, asobtained with P0f ;

• x p0f signature: the full P0f OS signature associated to the spamming machinenetwork behavior.



Any Sender object provides a reference method email that allows one to retrieve allEmail objects sent by a particular spamming machine (based on its IP address, bot andhost information).

SpamSet

As previously explained, a SpamSet object is equivalent to a “view” on the SpamClouddataset and is defined through a set of constraints (or WHERE clauses) that match aset of spam emails. While a set is an abstract object and is not therefore associated toany WAPI attribute, it provides a number of methods to perform aggregate statisticson the selected information:

• count sources: count the number of spamming sources included in the set;

• count emails: count the number of spam emails included in the set;

• count rcpt: count the number of (distinct) recipients included in the set;

• count from: count the number of (distinct) From domains included in the set;

• group by: returns a distribution that counts the number of emails by groupingthem according to combinations of specific criteria. One or more of the Email

object attributes can be set as grouping criteria.

• geo locations: return geographical information (latitude, longitude) of all spam-ming sources included in the set.

• list feature names: list the features currently available in SpamCloud for thisparticular spam set;

• get feature vectors: given a feature name (as returned by (list feature names))),it returns the value of the feature for every spam email belonging to the set.

3.4 BGP data sets (Spamtracer - BGPDB)

3.4.1 Control-plane Data - BGPDB

The goal of BGPDB is to gather BGP routing messages in a single, high-availability,searchable place. This section introduces the data collection process for BGPDB, as wellas its WAPI interface.



Overview

The observation of the BGP control plane has to be done from inside a BGP-enabledrouter. Since the infrastructure necessary to access such a router is quite large, someorganization, such as the Routing Information Service (RIS) branch from RIPE NCCprovides access to so-called looking glasses, which enables everyone to execute a definedset from a router. These commands are usually limited to the following commands:

• display the current routing table, either in its entirety, or by filtering on a givenset of prefixes;

• display various information about the state of BGP on the router: e.g. peeringrouters, number of exchanged messages per peer and per message type, connectionssettings (e.g. timeout, . . . );

• display various information about the router: hardware information, version, ma-chine load, . . . ;

• traceroute to a given IP address;

• AS-level traceroute to a given AS;

• ping to a given IP address.

This set of commands permits the observation of the current network state from a givenvantage point. On top of this access, RIS provides dump files of every message exchangedwith one of those routers, as well as snapshots of their routing table.

The messages dump files are created every five minutes and made available in com-pressed form at [23]. The routing table files are dumped every eight hours and areavailable from the same place. This network of router from RIPE is composed of 13geographically diverse routers. Their locations are pinpointed in figure 3.6.

This geographical diversity is important because it brings a more localized view of thenetwork. Indeed, an observer too far away from the source might not witness a somerouting events because they have been filtered out or aggregated by a router on the way.This diversity is thus desirable in a prefix hijacking detection situation in order to beable to detect even the smallest events. Moreover, this geographical segmentation helpsto assess the range of impact of that event.

As mentioned earlier, two types of archives are provided by RIPE: one containingexchanged BGP messages referred to as updates files, and another one containing therouting table, known as bview files. The update files contains every kind of BGP mes-sage: [31]



Figure 3.6: Location of the 13 RIPE RIS routers.



• open: this message is exchanged between two BGP-enabled routers right after theTCP connection between them is established. It contains BGP status informationsuch as the AS number of each party, the timeout value, . . . If a router accepts theBGP peering, it confirms so by sending a keep-alive message, then starts sendingits routing table content using update messages.

• keep-alive: this message is used to avoid connection tear down due to expiringtimeout.

• update: this message is used to announce and withdraw routes among peers. Theycontain so-called network-layer reachability information (NLRI), i.e. IP prefixes,and a set of attributes applied to the related routes (e.g. AS path, local preference,BGP community, . . . ).

• notification: this message indicates a fatal error, which leads to the end of thecurrent BGP session.

In real BGP operation, the routing information base (RIB), also known as the routingtable, is built from the received BGP messages: whenever a router accepts a route froman incoming update message (i.e. ingress filtering does not block the message out), itinserts that message in its RIB. So, a router’s RIB’s contents only reflects the contentsof the update messages that router has received. Because of this, storing the bview filesinside BGPDB is not strictly necessary. These files are kindly provided by RIPE toenable an easier reproduction of the router’s state, without the need to download andparse prior to the time period of interest. However, since BGPDB makes access to thedata much easier, this is not so much of an issue. As an added bonus, only the messagesrelated to the prefixes need to be retrieved, and so a lighter version of the RIB canbe built, containing only the prefixes of relevance for the current analysis. (Of course,building the whole RIB is still possible.) For the update files, only the update messagesare inserted into BGPDB. The reason behind this is that any other message is only thereto make BGP operate, not to exchange route informations.

On average, the size of the update messages once inserted in the database is around110 GB per month and collector router.

Figure 3.7 shows the overall BGPDB infrastructure. BGPDB imports data from RIPERIS’s update files, and makes it available to the world through the use of a WAPI server.

Data schema exposed

The data schema offered by BGPDB via WAPI is shown in figure 3.8. Its objects aredetailed below.



RIPE RISRaw Data

UpdateMessages

Routing Information

Base

BGPDB

WAPI Server

Figure 3.7: Overview of BGPDB data collection.

InformationThe object Information contains informations about the dataset. Its properties are:

• startTime: a timestamp value indicating the earliest message entry.

• endTime: a timestamp value indicating the latest message entry.

SetThe object Set is the usual WAPI entry point into the dataset. Its properties are:

• constraints: a set of constraints which limits the range of the results of therequests made to the database. These constraints can be of multiple types:

– temporal constraints: defines a timetable of interest for databases requests.

– spatial constraints: limits the IP space of interest for the requests. A limita-tion on collector routers can also be set.

Its functions are:

• countMessages(): approximates the number of messages returned in the list ref-erenced by messages, according to the constraints specified in constraints.

• countPrefixes(): approximates the number of messages returned in the list ref-erenced by prefixes, according to the constraints specified in constraints.



• countRouters(): approximates the number of collector routers returned in thelist referenced by routers, according to the constraints specified in constraints.

RouterThe object Router contains information about a collector router. This router is one ofthe 13 pinpointed in figure 3.6. Its properties are:

• hostname: the hostname of the router.

• country: the country in which the router is located.

• city: the city, within the country, where the router is located.

• constraints: a set of constraints which limits the range of answers of the requestsmade to the database. These constraints are similar to those of object Set, exceptfor the restriction on collector routers. Please note that this set of constraints doesnot inherit from the one defined in Set.

PrefixThe object Prefix contains information about a prefix. This prefix has been announcedat least once on the Internet. Its properties are:

• ipAddress: the IP address of the prefix, i.e. the (base) network address.

• mask: the mask for the prefix.

• maskLength: the mask length for the prefix. This is equal to the number of 1 inthe MSB of the mask.

Its functions are:

• lessSpecific(): returns the set of (at-least-once announced) prefixes strictly lessspecific than the current one.

• moreSpecific(): returns the set of (at-least-once announced) prefixes strictlymore specific than the current one.



MessageContains a BGP update message and its BGP properties. Its WAPI-properties are:

• type: indicates if the update message withdraws or updates routes.

• time: timestamp at which the message was received by the collector router.

• fromIP: IP address of the router that sent the update to the collector router.

• fromASN: ASN of the router that sent the update to the collector router.

• toIP: IP address of the collector router.

• toASN: ASN of the collector router.

• nextHop: IP address of the next hop router for the route.

• localPreference: the local preference value for the route.

• origin: the origin of the route: IGP, EGP, INCOMPLETE, or NONE. (The last valueis not part of the BGP standard, and is only used in case of a withdraw message,which does not have an origin attribute in BGP.)

• med: the value for the multi-exit discriminator.

Its functions are:

• aggregated(): this function returns true if the current route has been aggregated.

• aggregator(): this function returns the IP address of the route aggregator.

• communityCount(): returns the number of entries in the community attribute.

• community(): returns the BGP-community attribute.

• asPathCount(): returns the number of ASNs in the AS path.

• asPath(): returns the AS path for the route.

• nlriCount(): returns the number of prefixes this route applies to.

• nlri(): returns the prefixes affected by this message.


3.4 BGP data sets (Spamtracer - BGPDB)D

AT

AS

ET

InformationInformation about dataset

startTimeendTime

PrefixContains an IP prefix

ipAddressmaskmaskLength

lessSpecific()moreSpecific()

RouterA route collector router

hostnamecountrycity

constraints

information

set SetSet object

constraints

countMessages()countPrefixes()countRouters()

rou

ters

messages

messages

prefixes

messages

MessageA BGP update message

typetimefromIP, fromASNtoIP, toASNnextHoplocalPreferenceoriginmed

aggregated()aggregator()communityCount()community()asPathCount()asPath()nlriCount()nlri()

Figure 3.8: The BGPDB data schema exposed over WAPI.

3.4.2 Forwarding-plane Data - SpamTracer

Manipulating the Internet routing infrastructure to hijack an IP prefix automaticallymodifies the route taken by data packets so that they reach the physical network of theattacker. Based on this assumption a tool called SpamTracer has been developed tomonitor the routes towards malicious hosts by performing traceroute measurementsrepeatedly for a certain period of time. IP-level routes are also translated into AS-levelroutes using live BGP feeds. Routing anomalies can then be extracted from the routesand analyzed using the different features available, e.g., the ASes owner, the IP hopscountry, the length of the traced routes, etc.

The first motivation for monitoring data-plane routes towards specific malicious hostsis to collect the exact route towards them as soon as a malicious activity is observed



from them. Then by performing multiple measurements the consecutive days for acertain period of time, routing anomalies can be uncovered in case an attacker releasesa previously hijacked network. Such model basically only allows to observe the changefrom the hijacked state to the normal state as it is infeasible to monitor the whole IPspace in advance in the hypothesis of a future hijack. However if candidate networksthat could likely be hijacked in the future can be identified, future hijacks may be caughtwhen they start as well as when they stop.

Overview

The data collection framework SpamTracer is based on a simple linear data flow wherea feed of IP addresses to monitor is given as input and a series of enriched traceroute

paths with uncovered anomalies are produced as output. The complete data collectionprocess is illustrated in Figure 3.9. The different modules are described below in orderto explain the choice of the features offered by the SpamTracer dataset also describedlater in this document.

IP address feeds.The only input of the SpamTracer framework is lists of IP addresses or domain namesthat should be monitored. For each feed the duration of the monitoring period innumber of days can be set depending on the feed profile, e.g., networks likely hijackedin the future would be assigned long monitoring periods. Currently the IP address feedsincluded in SpamTracer are

Symantec.cloud: IP addresses of spammers sending spam to Symantec.cloud spamtraps.The complete spam dataset is described in Section 3.3.

Alexa.com: Top 500 web sites ranked by alexa.com [1].

Shadowserver: IP addresses of C&C servers uncovered by Shadowserver [6].

Spamhaus DROP: IP prefixes from the blacklist of networks allegedly hijacked by cy-bercriminals [7].

DShield: IP addresses of malicious hosts identified by DShield [2].

Russian Business Network: IP addresses of hosts identified as belonging to the RBNcyber-criminal organization provided by emergingthreats.com [3].

Malware Domain List: IP addresses of hosts considered dangerous by Malware DomainList [4].


alexa.com

emergingthreats.com


High-profile websites: Websites of possible targets of a malicious hijack, e.g., miscella-neous governmental institutions and some universities and companies.

The IP address feeds currently in SpamTracer are meant to study the correlationbetween IP prefix hijacking and various malicious activities. However, any type of IPaddress feed can be given as input to study IP prefix hijacks in another context.

IP traceroute.

A customized version of the classic traceroute function is used and is implementedin Python using the packet manipulation library Scapy. For each destination host, 30probe packets with incremented TTLs starting at 1 up to 30 are sent. Probe packets aresent in parallel to speed up the process. The base probe packet type is ICMP but whenno reply is received for a given TTL, a second round is performed using UDP probepackets. For TTLs from which still no reply was received at the second round usingUDP, TCP probe packets are used for a third round. For each round for a given TTL,three probe packets are sent before trying with the following packet type or giving upwith the TTL.

Paths uncovered using traceroute may have holes where no ICMP reply was receivedfor some TTLs. To deal with this issue, when no reply is received from a destinationhost, several IP addresses in the destination IP prefix are pinged to find a reachable hostin the same network. Such technique allows to record a traceroute path that is morecomplete than the previous one and that still reaches the same AS.

SpamTracer currently runs with a single traceroute vantage point. As a nextstep, SpamTracer is going to be deployed in several locations to increase the geo-graphic diversity of the measurements. The way traceroute uncover the route to ahost by sending series of independent probe packets may introduce wrong links due toload-balancing at routers. SpamTracer will also integrate the Paris traceroute [5] al-gorithm which addresses such route deficiencies.

IP-to-AS mapping.

Due to the many artifacts that can be found in IP-level routes uncovered usingtraceroute, studying anomalies in the Internet routing infrastructure using only suchroutes is a complicated task. Looking at the AS-level (i) allows to look at network routesfrom the same perspective as BGP which matters when studying IP prefix hijacking and(ii) it hides some artifacts of IP-level routes by looking at the network from a higher-levelview, e.g., load-balancing inside ASes.



The IP-to-AS mapping is performed using live BGP data queried from multiple Route-Views [10] route servers spread worldwide. Because traceroute is a live measurementand to allow the AS-level path to be as accurate as possible, it is important that theeach IP host is mapped to the AS announcing its IP prefix at that moment. Also theview of the routing in the Internet can differ from one location to another so geographicdistribution of BGP collectors is important. SpamTracer currently query seven BGPcollectors located in six different continents. Each IP hop is mapped to its IP prefix andthe origin AS of this prefix as seen by the different BGP collectors. For traceroute

destinations, the BGP AS path as well as other BGP related information is collected.

Route enrichment.

In the analysis phase of the monitoring process, further information is collected onthe traceroute destination host and the different IP hops and ASes traversed.

IP hops information: Information about the IP hops traversed by traceroute pathsincluding the domain name and the geolocation [?].

ASes information: Information about the ASes traversed by traceroute paths includ-ing the ASN, the IP prefix, the country code, the Internet Routing Registry, theAS allocation date and the AS owner.

Target network information: Information about the destination network and host ofthe traceroute including the presence of the IP prefix in the Team Cymru Bogonlist (reserved or unallocated IP blocks) [8] and the Spamhaus DROP list [7].

Data schema exposed

The SpamTracer dataset is available through WAPI, the API developed in the contextof the wombat project. The dataset schema is illustrated in Figure 3.10.

The SpamTracer dataset provides two objects available at the beginning of a WAPIsession: (i) the Information object and (ii) the Set object. While the Informationobject is only meant to provide general information about the current dataset, the Setobject allows to browse the dataset and retrieve enriched traceroute paths that matcha given set of constraints. From a Set instance, it is then possible to explore the datasetby retrieving traceroute (Traceroute) and BGP (BGP Route) routes, traversed IPhops (Host) and ASes (Network) and finally routing anomalies (Routing Anomalies)uncovered from the analysis of the collected routing information. Uncovered routinganomalies consists in abnormal changes in the traceroute or BGP routes towards a



IP ADDRESS FEEDS

Target host (IP address, domain name)

SPAMTRACER

IP traceroute (IP hops, hop count, latency) IP ho

ps informa

@on

(domain n

ame, geol

oca@on)

ASes informa@on (owner, registrar, alloca@on date)

ROUTE ENRICHMENT

IP-‐to-‐AS mapping (ASN, IP prefix, AS path)

Target network informa@on

(BGP rou@ng state, alloca@on status, blacklist hits)

Figure 3.9: Overview of the SpamTracer data collection.

given target network that indicate that an IP prefix hijack might have occurred. A target(Target) in the WAPI schema represents a host monitored by SpamTracer to whichmultiple traceroute paths have thus been uncovered and from which possible routinganomalies have been extracted. More details on the exploration of the dataset throughthe different WAPI objects available is provided in the next part of this document.Some scripts also present examples of typical WAPI sessions, for instance, to retrievetraceroute paths towards a given destination IP address or routing anomalies relatedto a given ASN.

Figure 3.11 illustrates the hierarchy of WAPI objects in the SpamTracer dataset.From the figure we can see that a Target in SpamTracer is defined as a collectionof T Traceroute paths to a given IP host. Each Traceroute consists of a series of HHosts and each Host can be mapped to N Networks, i.e., each IP hop can be mapped tomultiple (prefix,ASN) tuples. The BGP Routes collected from C BGP collectors arealso available for the last Host of each Traceroute. Finally, R possible Routing Anomaliesare uncovered for every Target.



- start_time- end_time

Information

Data Set Information

set

information

Dat

a Set

hosts

targets

bgp_routes

bgp_routes_asn_originbgp_routes_asn_in_path

hosting_networks

target

routing_anomalies

- type- subtype- is_benign- explanation- first_seen- last_seen

TYPE:BGP ORIGIN ANOMALY- asns- prefixes- times

TYPE:BGP PATH ANOMALY- time

TYPE:TR. DESTINATION ANOMALY- time

TYPE:TR. PATH ANOMALY- time

Routing Anomaly

Routing Anomaly

- constraints

- count_traceroutes()- count_targets()- count_bgp_routes()- count_routing_anomalies()

Set

Set Object

traceroutes

target

networks

traceroutetraceroutes_ip_desttraceroutes_ip_in_path

traceroutetraceroutes_asn_desttraceroutes_asn_in_path

traceroutes

destination_hosts

destination_networks

bgp_routes

target

- ip_address()- neighbours_ip_address()

Target

Traceroute Target

bgp_routes

- src_ip- dst_ip- dst_asns- hop_count- time- dst_host_reachable- dst_network_reachable- unreachable_neighbour- spamhaus_drop_hit- teamcymru_bogon_hit- origin

- ip_path()- ip_latency_path()- ip_domain_path()- ip_country_path()- ip_latlon_path()- ip_prefix_path()- as_path()- as_owner_path()- as_country_path()- as_registry_path()- as_allocationdate_path()

IP/AS Traceroute Path

Traceroute

hosted_host

- ip- domain_name- geo_countrycode- geo_countryname- geo_city- geo_region- geo_area- geo_postcode- geo_latitude- geo_longitude- first_seen- last_seen

Host

Traceroute IP Host

- asn- ip_prefix- geo_cc- registry- date_allocated- owner- first_seen- last_seen

Network

Traceroute Network (AS)- ip_prefix- origin_as- target_addr- bgp_collector- bgp_last_update- bgp_community- bgp_other_fields- first_seen- last_seen

- as_path()

BGP Route

BGP Route to an IP Address

Figure 3.10: WAPI interface to the SpamTracer dataset.

Objects, Methods and References

From the data schemas described above, methods and references provided by the dif-ferent WAPI objects naturally reflect the information they store and the links to otherobjects they offer. Table 3.3 provides a summary of the different WAPI objects withtheir available references. Below we provide a more detailed description of each objectin terms of its attributes, its methods and its references.

Information.

The Information object provides general information about the current dataset. Cur-rently, it provides the start and end time of the current dataset. It does not provide anymethod or reference.

Set.




Data Set The entry point of every WAPI session. informationset

Information General information about the current dataset. /

Set A set of SpamTracer objects matching given con-straints (start at time, end at time, destination IPaddress, destination ASN and destination country).

traceroutesbgp routestargetsrouting anomalies

Target A target represents a specific host monitored bySpamTracer.

traceroutesdestination hostsdestination networks

Traceroute A sequence of IP hops (Hosts), each of them mappedto one or more ASes (Networks), representing atraceroute path.

hostsnetworkstargetbgp routes

Host A IP hop (intermediate router or destination host)traversed by a traceroute.

traceroutetraceroutes ip desttraceroutes ip in pathhosting networkstargetbgp routes

Network An AS traversed by a traceroute. traceroutetraceroute asn desttraceroute asn in pathbgp routes asn originbgp routes asn in path

BGP Route The AS path for the announcement of an IP prefix byan ASN to a specific BGP collector.

/

Routing Anomaly(abstract)

A general routing anomaly uncovered from a set oftraceroute paths, i.e., a target.

/

BGP OriginAnomaly

An anomaly in the ownership of an IP prefix in BGP. target

BGP PathAnomaly

An anomaly in AS paths in BGP. target

Traceroute Desti-nation Anomaly

An anomaly related to the destination host of atraceroute.

target

Traceroute PathAnomaly

An anomaly related to a traceroute path. target

Table 3.3: Summary of SpamTracer WAPI objects.



The Set is a special WAPI object whose only purpose is, in this case, to represent acollection of SpamTracer objects which match constraints given at the object creation.These constraints allow to restrict the set of manipulated objects to only the ones ofinterest. The Set object is the only starting point of a WAPI session on the SpamTracerdataset. The references allow to retrieve the Traceroutes, BGP Routes, Targets andRouting Anomalies matching the given constraints. The currently available constraintsfor a Set are:

• a starting date;

• an end date;

• a destination IP address;

• a destination ASN;

• a destination IP address country.

Its methods are

• count traceroutes(): return the number of traceroute paths matching the con-straints

• count targets(): return the number of target hosts paths matching the constraints

• count bgp routes(): return the number of BGP routes paths matching the con-straints

• count routing anomalies(): return the number of routing anomalies extracted fromthe paths and matching the constraints

As an example, with the Set object and the destination ASN constraint, we can easilyquery all the traceroute paths to a destination host hosted on the given AS.

Target.

The Target object represents a specific host monitored by SpamTracer. It is as-sociated to the set of traceroute paths towards that host. A Target is also usuallyassociated to a single IP address. However, when a host is not reachable by traceroute

on a given day, SpamTracer tries to traceroute another host in the same network.In this case, a Target is associated with the IP address of the main host and also the IPaddress of possible neighbours. No attribute is provided by that object. Its methods are



• ip address(): the IP address of the main host represented by this target

• neighbours ip address(): the IP address of possible reachable neighbours traceroutedwhen the main host was not reachable

Finally, routing anomalies are uncovered and investigated for each target individually(see Figure 3.11). The references allow to retrieve the destination Hosts (IP hosts), thedestination Networks (ASes) and the Traceroutes associated to the Target.

Traceroute.

The Traceroute object represents a traceroute path towards a specific host. Itsattributes are

• src ip: the source IP address

• dst ip: the destination IP address

• dst asns: the list of ASNs the destination host was mapped to

• hop count: the length of the route in the number of IP hosts

• time: the time at which the route was collected

• dst host reachable: True if the destination host was reachable, False otherwise

• dst network reachable: True if the destination AS was reachable, False otherwise

• unreachable neighbour: the IP address of the normal destination host within thesame IP block when it is unreachable

• spamhaus drop hit: True if the destination IP address belongs to an IP prefix listedin the Spamhaus Drop list, False otherwise

• teamcymru bogon hit: True if the destination IP address belongs to an IP prefixlisted the TeamCymru bogons list

• origin: the origin feed name of IP address

The methods return the traceroute path from different perspectives, i.e., different levelof abstraction. The methods are

• ip path(): the sequence of IP hosts traversed



• ip latency path(): the sequence of latencies of the IP path

• ip domain path(): the sequence of domain names of the IP path

• ip country path(): the sequence of the country names of the IP path

• ip latlon path(): the sequence of (latitude,longitude) coordinates of the IP path

• ip prefix path(): the sequence of IP prefixes (in CIDR notation) of the IP pathretrieved from the IP-to-AS mapping

• as path(): the sequence of ASNs of the IP path

• as owner path(): the sequence of AS owner names of the IP path

• as country path(): the sequence of the AS country codes of the IP path

• as registry path(): the sequence of the Internet Routing Registries (RIRs) of theIP path

• as allocationdate path(): the sequence of the AS allocation dates of the IP path

The references allow to retrieve the Hosts, the Networks, the Target and the BGP Routesassociated to the Traceroute.

Host.

The Host object represents an IP host traversed by a traceroute. It can be either aintermediate router or a destination host. Its attributes are

• ip: the IP address of the host

• domain name: the domain name of the host retrieved from the reverseDNS reso-lution

• geo countrycode: the country code retrieved from the geolocation of the IP host

• geo countryname: the country name retrieved from the geolocation of the IP host

• geo city: the city retrieved from the geolocation of the IP host

• geo region: the region name retrieved from the geolocation of the IP host

• geo area: the area retrieved from the geolocation of the IP host



• geo postcode: the post code retrieved from the geolocation of the IP host

• geo latitude: the latitude retrieved from the geolocation of the IP host

• geo longitude: the longitude retrieved from the geolocation of the IP host

• first seen: the first time the host was seen in the context of SpamTracer

• last seen: the last time the host was seen in the context of SpamTracer

No method is provided by that object. From a Host it is possible to retrieve throughthe references the Traceroute the Host belongs to, the Traceroutes towards the Host IPaddress, the Traceroutes traversing the Host IP address, the hosting Networks, i.e., the(ASN, IP prefix) pairs resulting from the IP-to-AS mapping, the Target and the BGProutes related to the Host.

Network.

The Network object represents a network traversed by a traceroute and is charac-terized by a (ASN, IP prefix) pair. Its attributes are

• asn: the autonomous system number

• ip prefix: the IP prefix (in CIDR notation)

• geo cc: the country code retrieved from the geolocation of the AS

• registry: the Internet Routing Registry (RIR)

• date allocated: the date of allocation of the AS

• owner: the name of the owner of the AS

• first seen: the first time the network was seen in the context of SpamTracer

• last seen: the last time the network was seen in the context of SpamTracer

No method is provided by that object. References offered by the Network objects allowto retrieve the Host hosted by this Network, i.e., the IP hop this AS was mapped to, theBGP routes with this ASN as the origin ASN and the BGP routes with this ASN in theAS path.



BGP Route.

The BGP Route object represents the AS path as well other BGP information col-lected from different BGP collectors while performing the IP-to-AS mapping. A routeassociates a specific IP address with an IP prefix, an origin AS and the complete ASpath to one BGP collector. Its attributes are

• ip prefix: the IP prefix (in CIDR notation)

• origin as: the origin AS

• target addr: the IP address that was queried

• bgp collector: the queried BGP collector

• bgp community: the community of the retrieved BGP entry

• bgp other fields: the other fields of the retrieved BGP entry

• first seen: the first time the BGP route was seen in the context of SpamTracer

• last seen: the last time the BGP route was seen in the context of SpamTracer

Its only method is:

• as path(): the BGP AS path retrieved of the retrieved BGP entry

A BGP Route is only available for the destination host of a traceroute. No referenceis provided by this object.

Routing Anomaly.

The Routing Anomaly object represents a general routing anomaly extracted fromthe collected traceroute paths and BGP routes. This general routing anomaly objectis not meant to be used without being specialized in one of the four types of RoutingAnomalies. It actually provides a generic way to access common features of all routinganomalies. Its attributes are

• type: the name of the specialized routing anomaly actually referenced

• subtype: the name of the further classified routing anomaly within each type ofrouting anomalies



• is benign: the result of the investigation of extracted Routing Anomalies in order todetermine if they result from a benign BGP practice or if they are likely malicious

• explanation: the reason why this routing anomaly has been classified as benign

• first seen: the first time the routing anomaly was seen in the context of Spam-Tracer

• last seen: the last time the routing anomaly was seen in the context of Spam-Tracer

No method is provided by that object. From a Routing Anomaly object it is possible toretrieve the Target from which the anomaly was extracted.

BGP Origin Anomaly.

The BGP Origin Anomaly object represents an anomaly in the ownership of an IPprefix that is uncovered from BGP data. The anomaly consists in observing multipleconflicting ASes announcing the same or similar IP space. The anomalies considered arereferred to as Mutltiple Origin AS (MOAS) and Sub Multiple Origin AS (SubMOAS)conflicts. The MOAS conflict for a traceroute target host corresponds to a mappingof the host IP address to a single IP prefix but advertised by multiple ASes. TheSubMOAS conflict for a traceroute target host corresponds to a mapping of the hostIP address to multiple IP prefixes advertised by multiple ASes. Such conflicts can occurwithin a single traceroute path when the destination IP address is mapped by thedifferent BGP collectors to multiple ASes and/or IP prefixes. They can also occurwithin multiple traceroute paths when a routing change occur between consecutivetraceroute measurements. Its specific attributes are

• asns: the ASNs involved in prefix ownership conflicts

• prefixes: the IP prefixes involved in prefix ownership conflicts

• times: the list of times at which individual conflicts were observed

No method is provided by that object.

BGP Path Anomaly.

The BGP Path Anomaly object represents an anomaly in the AS Path that is uncov-ered from BGP data. The anomaly consists in observing a change in Next-Hop ASes,



i.e., the AS just after the origin AS, or a significant change in the complete AS paths.Its only specific attribute is

• time: the time at which the anomalous routing change occurred


Traceroute Destination Anomaly.

The Traceroute Destination Anomaly object represents an anomaly related to thedestination host of a traceroute. This type of anomalies currently include (i) hopcount anomalies, i.e., a significant change in traceroute paths length, (ii and iii) IP(AS) reachability anomalies, i.e., a significant change in the IP (AS) reachability of thedestination host, and (iv) destination AS country anomaly, i.e., the destination hostmapped to multiple ASes located in multiple countries. Its only specific attribute is



Traceroute Path Anomaly.

The Traceroute Path Anomaly object represents an anomaly uncovered in traceroute

paths. The anomaly consists in observing a significant change in the AS- or country-levelpaths. Its only specific attribute is



Investigation of a BGP hijack case

In this section we illustrate how an analysis of traceroute paths and BGP data usingthe WAPI interface can be carried out to investigate a BGP hijack case.

Q We start from the AS 31733 that is suspected of having been hijacked. How can weretrieve the collected data about this AS?

A We can query the SpamTracer dataset for a given ASN:



#Let’s build a set object with a destination AS constraint

s = spamtracer.set(dst_asn=31733)[0]

Q SpamTracer automatically extracts some types of routing anomalies from the col-lected data. How can we retrieve the routing anomalies extracted for our ASN andwhat information can we have about them?

A We can query the routing anomalies matching our previously created constraineddataset:

#Let’s directly look at the routing anomalies extracted

anomalies = s.routing_anomalies()

#Ok now let’s print some information about the anomalies

for anomaly in anomalies:

print anomaly.type, anomaly.subtype, anomaly.first_seen

Q From the extracted routing anomalies, we can in fact see that a significant changewas observed in the AS paths, the country paths and in the next-hop AS. Wherecan we observe these routing changes in the data?

A We can retrieve the traceroute paths and look at the AS and country paths:

#Let’s first retrieve the traceroute paths

tr = s.traceroutes()

#Now we can query and print the AS- and country-level paths

#before and after the first anomaly

tr_before = [t for t in tr if t.time < anomalies[0].time]

tr_after = [t for t in tr if t.time >= anomalies[0].time]

print "Before the anomaly"

for t in tr_before:

print t.as_path()

print t.ip_country_path()

print "After the anomaly:"

for t in tr_after:

print t.as_path()

print t.ip_country_path()

Q We can indeed observe a significant change in the traceroute AS paths and thecountry paths. Before the anomaly, we can see that the paths go to the US. Afterthe anomaly, the paths go to Russia. We can see in the AS and country paths that



the origin AS and the country of the last IP hop is always correct but the next-hopAS and country change looks suspicious.

A Can we get more information about this next-hop?

#Let’s consider the traceroute paths one more time

traceroutes = s.traceroutes()

#Now let’s retrieve the owner of the next-hop ASes

nexthop_ases_owner = set([traceroute.networks[-2].owner \

for traceroute in traceroutes])

#We can also retrieve the next-hop IP hosts

nexthop_hosts = [traceroute.hosts[-2] for traceroute in traceroutes]

#We can print the set of next-hop ASes owner

print nexthop_ases_owner

#We can also print information about the next-hop IP hosts

for host in nexthop_hosts:

print host.domain_name, host.country_name

print host.geo_latitude, host.geo_longitude

Q Can we observe the similar routing change in the AS paths in the collected BGP datafor all the BGP collectors used?

A We can retrieve the BGP routes associated to the traceroute paths:

#Let’s consider the traceroute paths again

traceroutes = s.traceroutes()

#Let’s retrieve the BGP routes from the traceroute paths

bgproutes = list()

for traceroute in traceroutes:

for bgproute in traceroute.bgp_routes:

bgproutes.append(bgproute)

#We can now print some features of the BGP routes

for bgproute in bgproutes:

#the IP prefix, the origin AS and the BGP collector

print bgproute.ip_prefix, bgp_route.origin_as, bgproute.bgp_collector

#the AS path

print bgp_route.as_path

#We can see that all BGP collectors observed the routing change.



Routing Anomaly1

Traceroute1

Network111

Target

Network11N

...

...Host11

TracerouteT

NetworkT11

NetworkT1N

...HostT1

...

Routing AnomalyR

...

...

Network1H1

Network1HN

...

BGP Route11

BGP Route1C

...

Host1H

NetworkTH1

NetworkTHN

...

BGP RouteT1

BGP RouteTC

...HostTH

Figure 3.11: Hierarchy of SpamTracer WAPI objects.


4 Interaction with Upper Layers (Preview)

4.1 Introduction

In the previous Sections we have shown how we have opened the access to the raw datasets of the VIS-SENSE data infrastructure. While this aspect is essential to enable part-ners retrieve data required for developing new analysis and visualization techniques, it isnot sufficient for enabling more advanced data processing and visualization tasks, as willbe performed in the upper layer of the VIS-SENSE framework. Indeed, for visual attackattribution, but also to enable advanced network correlation based on visual analytics,the raw data needs to be preprocessed, enriched, clustered, indexed and analyzed usingspecific data analytics algorithms. Therefore, we have started integrating the triagedata analytics framework to the VIS-SENSE infrastructure in order to automaticallypre-process the raw data as soon as new events are inserted into any of the datasets.

As previously described in D1.2 (Use case analysis and user requirements), triage isan attack attribution software module that relies on data fusion techniques and leveragesmulti-criteria decision algorithms to cluster security or attack events. Thanks to thisdata triage processing, virtually any type of security events can be automatically groupedtogether based upon a number of common elements (or features) likely due to the sameroot cause. As a result, triage can identify more complex patterns showing varioustypes of relationships among series of attacks or groups of disparate events, giving thusinsights into the manner by which attack campaigns and large-scale attack phenomenaare being orchestrated by cyber criminals, and more importantly, revealing also themodus operandi of their presumed authors.

The triage work flow, as well as the clustering algorithms and multi-criteria decisionanalysis (MCDA) techniques used within the framework, have been already introducedin D1.2 (Use case analysis and user requirements) and specified in D3.1 (Specificationsof the Network Analytics Algorithms) respectively. To make this deliverable as self-contained as possible, we briefly remind in here below the main concepts underlyingtriage, and more importantly we also highlight the recent improvements made regard-ing the scalability and usability of the framework. In the next subsection, we will thendescribe how the triage framework has been WAPI-enabled in order to provide ana-lytical services that are remotely accessible. By doing so, we enable triage to become

74

4.1 Introduction

a meta-data and service provider within the overall VIS-SENSE framework.Fig. 4.1 depicts the general triage approach as it was originally conceived and de-

veloped by Thonnard in [28], and applied on experimental data in the wombat projectto address the attack attribution problem [15, 30]. As we can see on Fig. 4.1, a typicaltriage analysis relies on three steps or components:

(i) Feature selection: we determine which features to include in the multi-criteriaanalysis. Each element of the data set (termed event in the triage jargon) isthen represented by a set of feature vectors that are extracted according to eachof these features;

(ii) Single-feature clustering: an undirected edge-weighted graph is built for eachfeature selected at step (i) using an appropriate distance metric. A clustering algo-rithm is then applied in order to find all significant patterns for each characteristic;

(iii) MCDA data fusion: the different cluster patterns uncovered at step (ii) are com-bined using an aggregation function that models some expert knowledge regardingthe expected behavior of the phenomena under study.

Σ

Per$feature Graph,based$analysis$

Mul5,criteria Aggrega5on$(data$fusion)$$

Mul5,Dimensional$Clusters$(MDC’s)$(visualiza2on)$Events$

Features$Selec5on$

“Vague$statements”$$on$the$nr$of$criteria$

(OWA)$

Interac5ons$$among$criteria$

(Choquet$integral)$

Figure 4.1: Overview of the original triage approach.

As outcome of this triage processing, we obtain so-called Multi-Dimensional Clus-ters (or MDC’s for short) which group security events that are correlated by a number



of common traits (i.e., a certain combination of features), which is likely to reflect acommon underlying root cause or a common modus operandi.

Note that this kind of multi-dimensional clustering involving potentially different com-binations of correlated characteristics, is conceptually close to what is referred to assubspace clustering for high-dimensional data [32], an extension of traditional cluster-ing techniques that seeks to find clusters in different subspaces within a dataset. Inhigh dimensional data, many dimensions are often irrelevant and can mask existing clus-ters due to noisy data. Subspace clustering algorithms localize the search for relevantdimensions allowing them to find clusters that exist in multiple, possibly overlappingsubspaces [17, 24]. However, while the goal of the two approaches – MCDA and sub-space clustering – is fairly similar, note that the techniques used to achieve it – i.e.,finding relevant clusters in high-dimensional space – are very different.

Improvements

In the user requirements described in VIS-SENSE Deliverable D1.2, we have identi-fied some important requirements regarding the triage algorithms, in particular forimproving the following aspects:

• scalability of the clustering algorithms, which is required for the application of thetriage analytics on larger data sets, like those available in the VIS-SENSE datacollection infrastructure;

• usability of the framework, to help define appropriate MCDA data fusion modelsthat truly reflect some high-level behavioral model that is more “vaguely” definedby security analysts (e.g., defining importances and pairwise interactions, like syn-ergies or redundancies, among different features or combinations hereof).

Much R&D work has been done already during VIS-SENSE towards these goals. As aresult, some substantial improvements in terms of scalability have been achieved thanksto a re-engineering of the triage clustering approach, but also thanks to a new designof the internal data structure for representing clusters. triage now leverages a newtwo-level architecture based on prototypes, which provides an automatic compression ofthe raw data sets and enables thus a more scalable clustering approach applicable tolarge data sets. This two-level cluster-prototype architecture is briefly explained in thenext paragraph, before moving to the description of the new triage service availableover WAPI.

A more detailed description of the new prototype-based extraction and clustering al-gorithm are provided in deliverable D3.3 (Attack Attribution Module). The idea that


4.2 Two-Level Cluster/Prototype Representation

we have developed for improving the scalability of the triage framework consists topre-process the raw data using a fast and lightweight prototype extraction algorithm toautomatically compress the data set. That is, we try to exploit the intrinsic propertiesof the data by summarizing groups of very similar data objects by so called prototypes– feature vectors or patterns that are typical for a group of homogeneous events. By re-stricting subsequent computations to a reduced set of prototypes and later propagatingthose results to the complete data set, we are not only able to accelerate the clustering,but we can also apply this method to much larger data sets. Prototype-based clusteringallow for run-time improvements over exact methods while inducing a minimal approx-imation error. Furthermore, the extracted prototypes correspond to typical patternsfound in the data set and can thus be easily inspected by a human analyst or be usedlater for visualization. Extracting a small yet representative set of prototypes from adata set is not a trivial task. Recently in [26], Rieck et al. have developed a methodbased on machine learning (which combines both clustering and classification) for auto-matic analysis of malware behavior by adapting a linear-time algorithm by Gonzalez [16]to provably determine a set of prototypes which is only twice as large as the optimal so-lution. Our prototype-based clustering method is a variant inspired by those previouslydeveloped algorithms and is further described in Deliverable D3.3.

Regarding the usability of the framework, we also made some good progress on devel-oping methods to assist the user in defining MCDA parameters. Different approacheshave already been tested, such as the novel idea of using optimization to obtain parame-ters for the aggregation functions, based on vague statements given as input by the user(e.g., combining linguistic operators and high-level expressions of pairwise interactionsamong features, like synergies and redundancies). A first step towards this goal has beenmade in 2011 in the context of an industrial project in collaboration with the Institutefor Pure and Applied Mathematics (UCLA ) [27]. More work is ongoing in this regardto improve further the usability of MCDA aggregation functions used in triage withthe development of methods to assist with the definition and fine-tuning of aggregationparameters (see Deliverable D3.3).

4.2 Two-Level Cluster/Prototype Representation

Figure 4.2 represents the two-level data architecture used internally in the triage sys-tem for multi-dimensional analysis. This new data architecture is based on a two-levelcluster/prototype representation.

• Prototype points are extracted for each feature (or characteristic) of the data set, inthe context of a specific clustering analysis, usually on a daily basis. The timespan



TRIAGE Multi-dimensional Analysis

Day 1 | Day 2 | Day 3 | …

Feat 1 | Feat 2 | Feat 3 | …

T I M E

F E A T U R E S

= Prototypes (data compression)

Cluster 1

Cluster K

Cluster 2

Cluster = Single feature analysis (across multiple days)

MDC

= Multi-feature analysis (on a single day)

MDC 1 MDC 2

Cluster 1

Figure 4.2: triage internal data architecture for multi-dimensional analysis, which isbased on a two-level cluster/prototype representation.

can be adjusted (e.g., weekly / monthly / . . . basis) depending on the size of thedata set and its growth.

• Cluster data structures are grouping prototypes of a given feature across multipleanalyses (e.g., across multiple days). Clusters may thus grow over time and newprototypes are being added incrementally to an existing cluster when needed (i.e.,when new prototypes are identical or very close to previous ones).

Finally, MD Clusters are multi-dimensional clusters (or MDC’s) that are generatedacross all features by linking all prototypes identified previously using data fusion, i.e.by aggregating all feature similarities (MCDA fusion). However, new MD Clusters canstill be compared globally to previous ones and linked to the closest MDC’s, so thatwe can follow a phenomenon matching an MDC, which is eventually spanning multipledays (e.g., a spam campaign lasting for one week, an attack campaign lasting for severalmonths, etc).


4.3 TRIAGE-as-a-Service


As introduced in Section 1 (Fig. 1.1), triage is integrated as a central element of thedata infrastructure and will thus also provide meta-data (e.g., clustering results) as wellas various analytical services to the VIS-SENSE framework. While each data set istransformed into a data provider, by using the very same WAPI mechanism triagebecomes a service and meta-data provider for the visual analytics layer.

Since this deliverable is about the data collection infrastructure, we describe herethe triage interaction with the other components regarding the aspect of meta-dataprovider, more particularly the triage data schema and the various methods that arebeing exposed over WAPI. However, a more complete description of the (visual) interac-tions that will be developed and implemented in the visual analytics modules will be pro-vided in Deliverables D3.3 (Attack Attribution Module) and D4.x, which are dedicated tostudying in-depth how to integrate new visualization techniques with the network dataanalytics techniques that are specifically developed in this project towards improvingattack attribution and security analysis.

Figure 4.3 represents the triage data set that is WAPI-enabled, showing all objectsand methods that are accessible through WAPI together with their interrelations bymeans of references. As for all other WAPI datasets, the triage Dataset object is thestarting point for traversing the dataset and provides some utility methods to get generalinformation on the dataset:

• list datasets: lists all datasets available to a given user;

• list analyses: lists all available MD analyses performed by a given user.

Note that two particular classes of objects (TR RandSet and TR FMTool) were definedas meta-objects for providing various analytical services, and will thus be further ex-plained in the appropriate deliverables (D3.3 and D4.x ). The other classes of objectswere defined to provide access to triage data structures and are further described hereafter.

TR Dataset

The TR Dataset object represents a VIS-SENSE dataset and thus wraps one of the fivedatasets defined previously. It can be instantiated from the main triage object byfollowing the appropriate reference method and by providing the required id or label.This object has two other attributes: size, which gives the total number of eventsanalyzed by triage for the instantiated dataset, and last update giving the date of the



-  id -  label -  size -  last_update

-  get_sample()

TR_Dataset

-  id, label -  dataset

-  get/set_params() -  get_ops_status() -  run_clustering() -  run_MCDA() -  get_matrix() -  delete_all()

TR_Analysis

-  id -  dataset

-  get_features() -  compare()

TR_Event -  Id -  dim -  created_on

-  get_size() -  get_compactness() -  get_patterns() -  get_summary() -  delete()

TR_Prototype

-  id, dim -  last_update

-  get_size() -  get_compactness() -  get_patterns() -  get_summary() -  add_prototypes() -  del_prototypes -  delete()

TR_Cluster

-  id -  created_on

-  get_size() -  get_compactness() -  get_patterns() -  get_summary() -  add_prototypes() -  del_prototypes -  delete() -  merge()

TR_MDCluster

analysis'TR

IAG

E (

Dat

aset

) Tr_Dataset'

neighbors'

events'

neighbors'

prototype'

(md)clusters'

prototypes'

prototypes'

(md)clusters'

prototypes'

-  label -  feature_list

-  set_values() -  set_importances() -  set_interactions() -  get_aggregation() -  set_threshold()

TR_FMTool FMTool'

events'

-  constraints -  size, params

- get_feature_vector() - get_similarity() - get_protos() - get_sim_mat() - get/set_params()

TR_RandSet

Tr_Analysis'

randset'

Figure 4.3: Schema of triage as meta-data and service provider (accessible over WAPI).

most recently analyzed event. The TR Dataset currently provides a method get sample,which returns a random sample event and all its characteristics. It is worth noting thatthe TR Dataset object is not a set meta-object, which provides an aggregate view ona dataset (as defined previously for other WAPI datasets), but is instead an ordinaryWAPI object.

TR Analysis

The TR Analysis object represents an MCDA analysis performed by triage on a givendataset. This object can be instantiated either from a TR Dataset object or directly fromthe main triage object by following the appropriate reference method and providingthe required arguments (id or label). A TR Analysis object provides also the followingset of methods:



• get params and set params: used to retrieve or to set analysis parameters, suchas the list of feature names included in the analysis, the clustering parameters, andthe MCDA weighting vectors, feature interactions and thresholds.

• run clustering: method to start a clustering analysis using the parameters de-fined previously using set params. A clustering analysis must be run for a specificfeature and date (i.e., all events observed on that date will be clustered), andcan be set to run incrementally or not. As outcome of calling this method, newPrototypes and Clusters will be created.

• run MCDA: will start an MCDA analysis (Multi-Criteria Decision Analysis) usingthe parameters (weights, thresholds) defined previously. As outcome of calling thismethod, new MDClusters (MDC’s) will be created.

• get ops status: utility method to retrieve the current operational status of theanalysis. This method will provide the progress status of any running task (e.g. ifa clustering task was started).

• get matrix: this method can be used to retrieve a similarity matrix calculated fora given feature or an aggregate matrix, as computed on a specific analysis date.

• delete all: can be used to delete all Prototypes, Clusters and MDClusters thathave been created in this analysis on a specific analysis date.

TR Prototype

The TR Prototype object represents a Prototype as created during a clustering analysis.Remember that a prototype is simply an ordinary event that has been selected duringthe prototype extraction phase to be representative of a larger set of events – called itsneighbors – because its feature vector is deemed being typical for a group of homogeneousevents. Remember also that a prototype is defined for a specific analysis date, and thusmultiple prototypes having the same typical pattern can coexist on different dates. Asone can see in Figure 4.3, this type of object can be accessed through various referencesstarting from other objects, such as TR Analysis, TR Cluster and TR MDCluster.

The TR Prototype object also provides various utility methods to retrieve high-levelinformation on a given prototype:

• get size: returns the size of a prototype object, i.e., the number of neighborevents,



• get compactness: returns the compactness of a prototype object (which can beviewed as a very compact cluster),

• get patterns: can be used to retrieve all patterns of a prototype object, includingall patterns of the neighbors,

• get summary: returns a high-level summary of a prototype object,

• delete: can be used to delete a prototype object extracted in a given triageanalysis.

TR Cluster

The TR Cluster object represents a Cluster as created during a clustering analysis.Remember that a cluster is defined as a group of prototypes being similar regarding aspecific feature and similarity metric. A cluster object can be created during a triageclustering analysis but can group prototypes across multiple analysis days (as illustratedin Figure 4.2) if the clustering is run incrementally. This type of object can be accessedthrough references starting from a TR Analysis object or a TR Prototype object.

The TR Cluster object also provides various utility methods to retrieve high-levelinformation on a given cluster or to change its composition:

• get size: returns the size of a cluster object, either the number of prototypes orthe total number of events (i.e., all neighbors of the prototypes belonging to thecluster),

• get compactness: returns the compactness of a cluster object (for a specific fea-ture),

• get patterns: can be used to retrieve all patterns of a cluster object, includingthe patterns of all prototypes and their respective neighbors,

• get summary: returns a high-level summary of a cluster object,

• add prototypes: can be used to associate new prototypes to a cluster object,

• del prototypes: can be used to remove prototypes from a cluster object,

• delete: can be used to delete completely a cluster object from a given triageanalysis.



TR MDCluster

The TR MDCluster object represents an MDCluster (or MDC) as created during anMCDA analysis. Remember that MDC’s are multi-dimensional clusters generated acrossall features by linking prototypes identified during a specific analysis using an MCDAfusion method, i.e., by aggregating all feature similarities. While an MDC is defined fora specific analysis date, new MDC’s can be compared to previously identified MDC’sand linked with the closest ones (i.e., the neighbor MDC’s). The incremental flag mustbe enabled during the MCDA analysis in order to link MDC’s.

This type of object can be accessed through references starting from a TR Analysis

object or a TR Prototype object. The TR MDCluster object also provides various utilitymethods to retrieve high-level information on a given MD Cluster or to change its com-position. These methods are fairly similar to those provided by the TR Cluster objectdescribed here above (with the important difference that an MDC is defined regardinga number of features, and not a single one).

TR Event

The TR Event object can be seen as a wrapper object for an event of a VIS-SENSEdataset, as analyzed by triage. This object can be accessed through various referencemethods starting from other objects, such as TR MDCluster (through the events ref-erence), TR Prototype (through the neighbors reference) or TR RandSet (through theevents reference). Similarly, two other references allow to access, starting from a givenTR Event, the Prototype or MDCluster that it belongs to (as resulting from an analysis).

Two particular methods are also provided to retrieve additional information on suchan object:

• get features: will provide the complete set of features of the event,

• compare: a utility method for computing similarities with any other TR Event

object.

TR RandSet and TR FMTool

TR RandSet and TR FMTool are two special meta-objects that were defined to providevarious analytical services to the visual analytics layer, e.g., running different clusteringmethods with various parameters on a limited, random set of events in order to evaluatethe most appropriate set of parameters, weights and thresholds based on the visualizedpatterns and using feedback provided by the user. As explained before, those objectswill be further detailed in the appropriate deliverables (D3.3 and D4.x ).


5 Conclusions

This deliverable has provided a detailed description of the design and implementationof the VIS-SENSE data collection infrastructure. To enable partners to build on appro-priate datasets and reach the goals set in the previously defined user scenarios, we haveintegrated different information sources which fall into two different categories: (i) datasets related to Internet routing protocol (SpamTracer and BGPDB), which will beused primarily in the BGP analysis scenario; and (ii) data sets related to various types ofInternet threats (e.g., harmur for client-side threats, sgnet for server code injectionsand SpamCloud for spam data), which will help achieve the goals set in the scenarioon the visualization of the Internet threat landscape.

For each dataset, we have described in detail which information was available and howthe selected data set relates to the previously defined user scenarios (D1.2). We havealso provided a detailed description of the data schemas that are remotely accessiblethrough the WAPI programming interface, in terms of attributes and methods beingexposed through the WOMBAT API.

Finally, we started to explain how the raw data will be enriched, clustered, indexed andanalyzed using specific data analytics techniques (namely triage analytics), in order toenable later on more advanced processing and visualization tasks in the visual analyticslayer of the VIS-SENSE framework, such as visual attack attribution and advancednetwork correlation.

84

Bibliography

[1] Alexa the web information company. urlhttp://www.alexa.com/.

[2] Dshield: Cooperative network security community. http://www.dshield.org/.

[3] Emerging threats. http://www.emergingthreats.net/.

[4] Malware domain list. http://www.malwaredomainlist.com/.

[5] Paris traceroute. http://www.paris-traceroute.net/.

[6] Shadowserver. http://www.shadowserver.org/.

[7] Spamhaus drop list (don’t route or peer). http://www.spamhaus.org/drop.

[8] Team cymru ipv4 fullbogons. http://www.team-cymru.org/Services/Bogons/

fullbogons-ipv4.txt.

[9] The WOMBAT Project. http://www.wombat-project.eu.

[10] University of oregon route views project. http://www.routeviews.org/.

[11] Wombat Deliverable D22 (D5.2) Root Causes Analysis: Experimental Report.http://wombat-project.eu/deliverables/.

[12] P. Baecher, M. Koetter, T. Holz, M. Dornseif, and F. Freiling. The Nepenthes Plat-form: An Efficient Approach to Collect Malware. In 9th International Symposiumon Recent Advances in Intrusion Detection (RAID), September 2006.

[13] Composite Blocking List. http://cbl.abuseat.org.

[14] M. Dacier, V. Pham, and O. Thonnard. The WOMBAT Attack Attribution method:some results. In 5th International Conference on Information Systems Security(ICISS 2009), 14-18 December 2009, Kolkata, India, Dec 2009.

[15] M. Dacier, V. Pham, and O. Thonnard. The WOMBAT Attack Attribution method:some results. In 5th International Conference on Information Systems Security(ICISS 2009), 14-18 December 2009, Kolkata, India, Dec 2009.

85

http://www.dshield.org/

http://www.emergingthreats.net/

http://www.malwaredomainlist.com/

http://www.paris-traceroute.net/

http://www.shadowserver.org/

http://www.spamhaus.org/drop

http://www.team-cymru.org/Services/Bogons/fullbogons-ipv4.txt

http://www.team-cymru.org/Services/Bogons/fullbogons-ipv4.txt


http://www.routeviews.org/

http://wombat-project.eu/deliverables/

Bibliography

[16] T. Gonzalez. Clustering to minimize the maximum intercluster distance. TheoreticalComputer Science, 38, 1985.

[17] H.-P. Kriegel, P. Kroger, and A. Zimek. Clustering high-dimensional data: A surveyon subspace clustering, pattern-based clustering, and correlation clustering. ACMTrans. Knowl. Discov. Data, 3(1):1:1–1:58, Mar. 2009.

[18] C. Leita, U. Bayer, and E. Kirda. Exploiting diverse observation perspectives toget insights on the malware landscape. In DSN 2010, 40th Annual IEEE/IFIPInternational Conference on Dependable Systems and Networks, June 2010.

[19] C. Leita and M. Cova. HARMUR: Storing and analyzing historic data on maliciousdomains. In The BADGERS workshop, Building Analysis Datasets and GatheringExperience Returns for Security, April 2011.

[20] C. Leita and M. Dacier. SGNET: a worldwide deployable framework to supportthe analysis of malware threat models. In 7th European Dependable ComputingConference (EDCC 2008), May 2008.

[21] C. Leita, M. Dacier, and F. Massicotte. Automatic handling of protocol depen-dencies and reaction to 0-day attacks with ScriptGen based honeypots. In 9thInternational Symposium on Recent Advances in Intrusion Detection (RAID), Sep2006.

[22] C. Leita, V. H. Pham, O. Thonnard, E. Ramirez-Silva, F. Pouget, E. Kirda, andM. Dacier. The Leurre.com Project: Collecting Internet Threats Information usinga Worldwide Distributed Honeynet. In 1st Wombat Workshop, 2008.

[23] R. NCC. Routing Information Service, Raw Data. http://www.ripe.net/

data-tools/stats/ris/ris-raw-data. [Online; accessed 29-Mar-2012].

[24] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data:a review. SIGKDD Explor. Newsl., 6(1):90–105, June 2004.

[25] G. Portokalidis, A. Slowinska, and H. Bos. Argos: an emulator for fingerprintingzero-day attacks. In ACM Sigops EuroSys, 2006.

[26] K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic analysis of malwarebehavior using machine learning. J. Comput. Secur., 19(4):639–668, Dec. 2011.

[27] K. N. D. W. Sandra Rankovic, Evgeni Dimitrov. Optimization of multi-criteriadecision analysis methods. Technical report, Institute for Pure and Applied Math-ematics, UCLA, 2011.


http://www.ripe.net/data-tools/stats/ris/ris-raw-data

http://www.ripe.net/data-tools/stats/ris/ris-raw-data

Bibliography

[28] O. Thonnard. A multi-criteria clustering approach to support attack attribution incyberspace. PhD thesis, Ecole Doctorale d’Informatique, Telecommunications etElectronique de Paris, March 2010.

[29] O. Thonnard and M. Dacier. A strategic analysis of spam botnets operations. InProceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse andSpam Conference, CEAS ’11, pages 162–171, New York, NY, USA, 2011. ACM.

[30] O. Thonnard, W. Mees, and M. Dacier. Addressing the attack attribution problemusing knowledge discovery and multi-criteria fuzzy decision-making. In Proceedingsof the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics,CSI-KDD ’09, pages 11–21, New York, NY, USA, 2009. ACM.

[31] I. van Beijnum. BGP. O’Reilly Media, Inc., Sebastopol, CA, USA, September 2002.

[32] R. Vidal. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):52 –68,march 2011.

[33] S. Zanero and P. M. Comparetti. The WOMBAT API: querying a global networkof advanced honeypots. In BlackHat DC, 2010.


d2.2 data collection infrastructure - vis-sense · 1.1 data collection infrastructure fig. 1.1...

Documents