, romÂnia lessons learned from correlation of … volumes/iwssc/iwsss-l04-sokol.pdf · 30 june -02...

8
ECAI 2016 - International Conference – 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Lessons learned from correlation of honeypots' data and spatial data Pavol Sokol Institute of Computer Science, Faculty of Science Pavol Jozef Šafárik University in Košice Košice, Slovakia [email protected] Veronika Kopčová Institute of Mathematics, Faculty of Science Pavol Jozef Šafárik University in Košice Košice, Slovakia [email protected] Abstract – Honeypots and honeynets are unconventional security tools for the purpose of studying the techniques, methods, tools, and goals of attackers. Analysis of data collected by these security tools is important for network security. In this paper, we focus on information about the locations, shapes of geographic features and the relationships between them, usually stored as coordinates and topology (spatial data). We discuss specific spatial data related to countries and analyse them in relationship to number of attempted attacks collected by honeypots. In the paper, we analyse the relationship between the spatial data and number of attempted attacks and properties of countries, from which attackers attack. Keywords - honeypot; honeynet; spatial data; attack; GIS I. INTRODUCTION Cyberspace offers new opportunities, but it is also a source of new threats for both, individuals and for organizations. Therefore, network security has become an increasingly important part of modern society. Traditionally, information security is primarily defensive and uses conventional tools to protect the information (e.g. firewalls). For this purpose, it is necessary to collect and investigate as much information about these communities as possible. From this point of view, honeypot seems to be very useful tool. It can be defined as “a computing resource, whose value is in being attacked” [1]. Lance Spitzner defines honeypot as “an information system resource whose value lies in unauthorized or illicit use of that resource” [2]. The most widespread classification of honeypot is based on the level of interaction. The level of interaction can be defined as the range of possibilities the attacker is given after penetrating the system. There are low-interaction and high-interaction honeypots. On one hand, low-interaction honeypots emulate the characteristics of network services or a particular operating system (e.g. Dionaea [3]). On the other hand, a complete operating system with all services is used to get more accurate information about attacks and attackers [4]. This type of honeypot is called high-interaction honeypot (e.g. HonSSH [5]). Concept of honeypot is extended by a special kind of high-level interaction honeypot – honeynet. The honeynet can be also referred to as "a virtual environment, consisting of multiple honeypots, designed to deceive an intruder into thinking that he or she has located a network of computing devices of targeting value" [6]. Honeynet consists of four parts, namely data control, data capture, data collection and data analysis [1,6]. Collection and analysis of data captured using honeypots and honeynets is the main purpose of using these tools. Learning new unconventional information about the attacks, attackers and tools helps with protection of the network services and computer networks of organizations. Each honeypot collects the IP addresses of attackers. It is possible to obtain several interesting data and information from IP address, for example, name of the internet provider, location of computer or server, and time-zone of the honeypot. Geographic coordinates of the attacker´s IP address allow for extracting subsidiary data, such as country, region, city and time zone etc. The above mentioned data that can be referred to as spatial data as their location within the geographical space can be extracted from the IP address. It enables the global finding and locating of individuals or devices anywhere in the world [7]. Spatial data is also known as geospatial data, spatial information or geographic information. From the perspective of research, the geographical location of the attackers may be useful for identifying attacks. This paper is a sequel to the analysis of data collected from honeypots and honeynets. In paper [8] we focus on time-oriented data and discuss the relationship between time and data captured by honeypots. On the other hand, the main aim of this paper is to obtain information about attackers using analysis of spatial- oriented data. This paper is interdisciplinary and combines geoinformatics, information security and mathematical statistics. To formalize the scope of our work, we state two research questions: What is the relationship between the spatial data and number of attempted attacks? From which countries do the attackers attack? This paper is organized into eight sections. In section II paper focuses on the papers related to lessons learned from analysis in the honeypots and honeynets. Section III outlines the dataset and research method for experiment. Section IV is introduction to spatial analysis. Sections V, VI and VII focus on specific spatial data from various aspects: Internet

Upload: vuongque

Post on 02-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

ECAI 2016 - International Conference – 8th Edition

Electronics, Computers and Artificial Intelligence

30 June -02 July, 2016, Ploiesti, ROMÂNIA

Lessons learned from correlation of honeypots'

data and spatial data

Pavol Sokol

Institute of Computer Science, Faculty of Science

Pavol Jozef Šafárik University in Košice

Košice, Slovakia

[email protected]

Veronika Kopčová

Institute of Mathematics, Faculty of Science

Pavol Jozef Šafárik University in Košice

Košice, Slovakia

[email protected]

Abstract – Honeypots and honeynets are unconventional

security tools for the purpose of studying the techniques,

methods, tools, and goals of attackers. Analysis of data

collected by these security tools is important for network

security. In this paper, we focus on information about

the locations, shapes of geographic features and the

relationships between them, usually stored as

coordinates and topology (spatial data). We discuss

specific spatial data related to countries and analyse

them in relationship to number of attempted attacks

collected by honeypots. In the paper, we analyse the

relationship between the spatial data and number of

attempted attacks and properties of countries, from

which attackers attack.

Keywords - honeypot; honeynet; spatial data; attack;

GIS

I. INTRODUCTION

Cyberspace offers new opportunities, but it is also a source of new threats for both, individuals and for organizations. Therefore, network security has become an increasingly important part of modern society. Traditionally, information security is primarily defensive and uses conventional tools to protect the information (e.g. firewalls). For this purpose, it is necessary to collect and investigate as much information about these communities as possible. From this point of view, honeypot seems to be very useful tool. It can be defined as “a computing resource, whose value is in being attacked” [1]. Lance Spitzner defines honeypot as “an information system resource whose value lies in unauthorized or illicit use of that resource” [2].

The most widespread classification of honeypot is based on the level of interaction. The level of interaction can be defined as the range of possibilities the attacker is given after penetrating the system. There are low-interaction and high-interaction honeypots. On one hand, low-interaction honeypots emulate the characteristics of network services or a particular operating system (e.g. Dionaea [3]). On the other hand, a complete operating system with all services is used to get more accurate information about attacks and attackers [4]. This type of honeypot is called high-interaction honeypot (e.g. HonSSH [5]).

Concept of honeypot is extended by a special kind of high-level interaction honeypot – honeynet. The honeynet can be also referred to as "a virtual environment, consisting of multiple honeypots,

designed to deceive an intruder into thinking that he or she has located a network of computing devices of targeting value" [6]. Honeynet consists of four parts, namely data control, data capture, data collection and data analysis [1,6].

Collection and analysis of data captured using honeypots and honeynets is the main purpose of using these tools. Learning new unconventional information about the attacks, attackers and tools helps with protection of the network services and computer networks of organizations. Each honeypot collects the IP addresses of attackers. It is possible to obtain several interesting data and information from IP address, for example, name of the internet provider, location of computer or server, and time-zone of the honeypot. Geographic coordinates of the attacker´s IP address allow for extracting subsidiary data, such as country, region, city and time zone etc.

The above mentioned data that can be referred to as spatial data as their location within the geographical space can be extracted from the IP address. It enables the global finding and locating of individuals or devices anywhere in the world [7]. Spatial data is also known as geospatial data, spatial information or geographic information. From the perspective of research, the geographical location of the attackers may be useful for identifying attacks. This paper is a sequel to the analysis of data collected from honeypots and honeynets. In paper [8] we focus on time-oriented data and discuss the relationship between time and data captured by honeypots. On the other hand, the main aim of this paper is to obtain information about attackers using analysis of spatial-oriented data. This paper is interdisciplinary and combines geoinformatics, information security and mathematical statistics.

To formalize the scope of our work, we state two research questions:

What is the relationship between the spatial data and number of attempted attacks?

From which countries do the attackers attack?

This paper is organized into eight sections. In section II paper focuses on the papers related to lessons learned from analysis in the honeypots and honeynets. Section III outlines the dataset and research method for experiment. Section IV is introduction to spatial analysis. Sections V, VI and VII focus on specific spatial data from various aspects: Internet

Pavol Sokol, Veronika Kopčová

2

users, population and economic aspects. The last section contains conclusions and authors´ suggestions for the future research.

II. RELATED WORKS

As it was mentioned before, the main role of honeypots and honeynet is analysis of captured data and search for new knowledge about attacks and attackers. This section provides survey of papers that focus on lessons learned from deploying honeypots.

Canto et. al. [9] used SGNET as a distributed system of honeypots. They doubt the creation of representative malware samples’ datasets. Also, they claim that the false negative alerts are something different from what they are considered to be. What´s more, the false positive alerts occur on unexpected places. In another paper [10] authors focus on clustering attack patterns with an appropriate similarity measure. The results of his paper enable identification of the activities of several worms and botnets in the collected traffic.

Nicomette et. al. [11] and Alata et. al. [12] focus on analysis of data collected by high-interaction honeypots. In the first paper [11] authors discuss the attacks performed via the SSH service and the activities performed after attackers gain access to the honeypot. In the paper [12] authors focus on attackers and their activities after logging. Authors correlated their funding with results from distributed low-interaction honeypots.

On the other hand, Sochor and Zuzcak in papers [13,14] focus on low-interaction honeypots. In paper [13] authors claim that data show currently spreading threats caught by honeypots. On the other hand, they outline the thorough interpretation of lessons learned by honeypots. In the second paper [14] authors focus on the most important result consists and highlight the fact that the differentiation among honeypots according to their IP address is relatively rough (e.g. differentiation for academic and commercial network).

In [8] authors focus on time-oriented data, they outline visualization of this kind of data in honeypots and honeynets. Authors also provide results from honeynet based on special visualization – heatmaps. The result of analysis proved that the time is an important aspect of attacks. Attackers are mostly active at night (according to honeynet´s time zone).

Further example of usage of low-interaction honeypots (Dionaea) for the purpose of studying is paper [15]. Author presents the results of nearly two years’ operation of honeypot systems, installed on unprotected research network. Paper focuses on the information about the life time of malware programs and the long-time malware activity.

Above mentioned papers and research groups focus on analysis of data captured by honeypots and honeynets. They analysed data from various aspects and discuss the lessons learned from honeypots. In this paper, we focus on several types of spatial data and correlate the data obtained from honeypots and the perspective spatial data related to countries.

III. EXPERIMENTAL DESIGN

The data were collected from the honeynet, consisting of both high-interaction and low-interaction honeypots. The honeynet is located in the campus network. The basis of a honeynet is an instance of HoneyD [16], which populates majority of the IP space and emulates various services. On the other hand, the high-interaction honeypots enhance the capabilities of the honeynet by hosting services that are either hard to emulate via HoneyD or are not suitable for emulation. We used honeynet monitoring tool Honeyscan [17] to evaluate the activity of attackers in the honeynet using network flow monitoring [18]. Each day, around 15,000 network flows incoming to the honeynet were observed. Significant portion of the network flows was generated by TCP SYN scanning, mostly on ports 22, 80, and 3389. We used emulated services on mostly scanned ports to capture data from interaction between attackers and honeypots. Logs generated by these services were used for further analysis.

Also we evaluated authentication attempts (attempted attacks) on five selected password-protected services: SSH, FTP, Telnet, POP3, and IMAP. Although those services were deployed on frequently scanned ports, only SSH, POP3, and IMAP provided significant number authentication attempts. In total, we obtained 3,317,597 records from SSH, 660,475 records from POP3, and 81,923 records from IMAP. Number of records from FTP and Telnet protocols was negligible. Each record contains username and password used in an attempt, as well as timestamp and IP addresses of an adversary and a honeypot. Time series can be built upon events as well as individual element of a record.

For purposes of this paper, each record was supplemented with spatial data using the IP-API.com service [19]. This service provides free use of its Geo IP API through multiple response formats. Each record was supplemented with time zone, country, region, city, Internet service provider (ISP), and global positioning system´s (GPS) coordinates. Also, we used the data of World DataBank [20] for analysing of correlation between attempted attacks and spatial data. We focus on selected 130 spatial data. Considering the large number of attempted attacks from China, we split the results of the analyses in two groups: with and without China, respectively.

Data cleaning and analysing was performed using, the HoneyLog framework [21] and a geographic Information System - ArcGIS [22]. HoneyLog is a framework for analysing honeypots’ and honeynets’ data and it is based on a PHP framework of FuelPHP and JavaScript libraries. It consists of a client part and a server part. On the other hand, ArcGIS is a geographic information system (GIS) to visualize, query, analyse, and interpret data for understanding relationships, patterns, and trends.

To verify the results generated by HoneyLOG and GIS we use the statistical correlation. The most familiar measure of dependence between two quantities (in our case - attempted attacks and spatial data) is the Pearson's correlation coefficient [23].

Lessons learned from correlation of honeypots' data and spatial data

3

The correlation is +1 in the case of a perfect direct increasing linear relationship, −1 in the case of a perfect decreasing linear relationship. The meaning of dependencies [24]:

very strong increasing correlation in (0,9; 1>;

strong increasing correlation in (0,67; 0,9>;

moderate increasing correlation in (0,35; 0,67>;

weak increasing correlation in (0,1; 0,35> and

no correlation in (0; 0,1>.

Multiple R is Pearson's correlation coefficient and R-squared is the percentage of the response variable variation that is explained by a linear model. It is a statistical measure of how close the data are to the fitted regression line. In general, the higher the R-squared, the better the model fits our data. However, in different areas the different level of R-squared is satisfactory. On the one hand in psychology 20% is enough, but in physics more than 80% is necessarily. For our data about 50% is very good level of R-squared.

In table of analysis of variance (ANOVA) [25] we test null hypothesis that linear coefficient is 0 versus that linear coefficient is non zero. Significance F has the associated P-value. While significance F is lower than 0.05 we reject null hypothesis and conclude that linear model fits our data. The last table shows coefficients of regression model; P values shows significance of these coefficients. Since the P value is less than 0.05, we reject the null hypothesis that the two variables are unrelated. In other words, there is a relation between the two variables and the coefficient is statistically significant. In some cases, nonlinear dependence is better. Our model can be analysed in the same way as in linear case.

IV. LESSONS FROM SPATIAL DATA

We analysed 130 selected types of spatial data from the World DataBank. Most of them have not significant correlation with a number of attempted attacks. For example, countries´ employment, electricity production, energy use, military expenditure, number of secure internet servers, literacy rate etc.

In this paper, we focus on types of spatial data that have significant correlation with a number of attempted attacks (about 18 types of spatial data from the World DataBank). The various types of spatial data without significant correlation are only mentioned. The types of spatial data with significant correlation with number of attempted attacks are split into three categories according to three aspects (spatial data related to users, spatial data related to population and spatial data related to economic value).

In the following sections we discuss each above mentioned aspects. The sections contain specific graphs and maps for better representation of several results. Graphs show the correlation relationship between number of attempted attacks and countries´ spatial data. In each graph x-axis represents countries´

spatial data and y-axis represents number of attempted attacks. On the other hand, map shows origin of attempted attacks. Number of attempted attacks from place represents size of point. Spatial data of country is illustrated by staining the country (higher value is darker).

V. USERS´ ASPECTS

The first is the category of users´ aspects. In this

category we discuss two types of spatial data (Internet

users and subscriptions), which has significant

correlation values with attempted attacks (Tab. 1).

In the Tab. 1 the polynomial correlation value of

spatial data is shown. In the case of mobile cellular

subscriptions, we also show the quadratic correlation

value. It is due to the fact that this spatial data shows

better values of quadratic correlation.

TABLE I. TABLE OF CORRELATIONS - USERS´ ASPECTS

Spatial data of countries Correlation

(without China)

Correlation

(with China)

Internet users 0,7255 0,8945

Fixed broadband subscriptions

0,6769 0,8979

Fixed telephone subscriptions

0,7127 0,8702

Mobile cellular subscriptions

0,5675 0,7972

Figure 1. Graph of correlation between Internet users and

number of attempted attacks (without China)

A. Internet users

The first investigated criteria are the number of Internet users in country. Internet users are individuals who have used the Internet (from any location) in the year 2014. Internet can be used via a computer, digital TV, mobile phone etc. [20]. The correlation between this type of spatial data and number of attempted attacks (without China) is shown in Fig. 1. The map of Internet users and number of attempted attacks is shown in Fig. 2.

The dependence between Internet users and number of attempted attacks is strong in both cases - with China and without China. It is obvious from correlation coefficient in interval (0,67; 0,9>. The value of R square in case with China means that 80% of variance are explained by our regression line. ANOVA shows significance F as 9,67*10-37, so we can conclude that our model is good enough. The last part of analysis shows coefficients of regression line. P

Pavol Sokol, Veronika Kopčová

4

value again shows that both coefficients, linear and absolute, are significant, so the regression line is: y=0,002827x-36944. In the model without China, 52% of variance is explained by this model and it is significant by the significance F. Now the only significant coefficient is linear, so regression line is y=0,000664x.

Figure 2. Map of Internet users and number of attempted

attacks

B. Subscriptions

Fixed broadband subscriptions refers to fixed subscriptions to high-speed access to a TCP/IP connection, at downstream speeds equal to, or greater than, 256 kbit/s. This includes cable modem, DSL, fiber-to-the-home/building, other fixed (wired)-broadband subscriptions, satellite broadband and terrestrial fixed wireless broadband. The correlation between this type of spatial data and number of attempted attacks (without China) is shown in Fig. 3. Fixed telephone subscriptions refers to the sum of active number of analogue fixed telephone lines, voice-over-IP (VoIP) subscriptions, fixed wireless local loop subscriptions, ISDN voice-channel equivalents and fixed public payphones.

Figure 3. Graph of correlation between fixed broadband

subscriptions and number of attempted attacks (without China)

The last type of subscriptions are mobile cellular telephone subscriptions, which are subscriptions to a public mobile telephone service. The indicator includes the number of post-paid subscriptions, and the number of active prepaid accounts. The map of this type of spatial data and number of attempted attacks is shown in Fig. 4.

The linear dependence between fixed broadband subscriptions, mobile cellular subscriptions and number of attempted attacks is strong in case with China and moderate without China. For fixed telephone subscriptions it is strong for both cases. In

all cases ANOVA shows P value less than 0.05, so we can conclude that our models are good enough. The value of R square in case with China is for all cases bigger than 60 %. P value for regression coefficients show that both coefficients are significant for all groups with China, so the regression lines are: y=0.009582x-29797 for fixed broadband subscriptions, y=0.007173x-33546 for fixed telephone subscriptions and y=0.001176x-32826 for mobile cellular subscriptions. For model without China more than 32% of variance are explained by all models. Now the only significant coefficient is linear, so regression lines are y=0.002141x for fixed broadband subscriptions, y=0.001576x for fixed telephone subscriptions and y=0.000195x for mobile cellular subscriptions. Mobile cellular subscriptions are the only one case when nonlinear dependence is better than linear. It can be shown from value of R square for quadratic dependence, that is 48% whereas in linear it was only 32%. Correlation coefficient is bigger too, model is good enough and linear and quadratic coefficients are significant, so y=-3.9*10-

13x2+0.00049x. For model with China it is y=1.7*10-

12x2-0.0065x+26761.

Figure 4. Map of mobile cellular subscriptions and number of

attempted_attacks

Based on the above mentioned it can be concluded that there is a linear relationship between users´ aspects (the number of Internet users, subscriptions) and the number of attempted attacks. It could be explained by two facts: firstly, the more malicious users, the more infected devices. Secondly, it is claimed that users are the weakest part of the information security. Moreover, we consider that there is no correlation between the secure server and there is above mentioned linear relationship. On that basis, we can be inferred that originator of attacks are client devices (computers, laptops, mobile phones etc.) and not service devices (routers, switches, servers).

VI. POPULATION ASPECTS

The second category is aimed at population aspects. In this category we discuss four types of spatial data (total population, age of population, urban and rural population and population in urban agglomerations and in largest city), which has significant correlation values with attempted attacks (Tab. 2 and Tab. 3). In case of total population and age of population we consider 3 cases:

all countries are relevant (case 1);

China is omitted (case 2) and

China with India are omitted (case 3).

Lessons learned from correlation of honeypots' data and spatial data

5

TABLE II. POPULATION ASPECTS

Spatial data of countries Correlation

(all countries)

Correlation

(without China

and India)

Total population 0,7336 0,6405

Population (ages 0-14) 0,5319 0,4712

Population (ages 15-64) 0,7744 0,6652

TABLE III. POPULATION ASPECTS II.

Spatial data of countries Correlation

(all countries)

Correlation

(without China)

Urban population 0,8436 0,6723

Population in urban

agglomerations of more

than 1 million

0,8172 0,6269

Population in largest city 0,3462 0,4728

The reason of misrepresentation of the result is the fact that those countries are densely populated (China - 18.8% of world population and India - 17.6% of world population [20]).

Figure 5. Graph of total population of countries and number

of attempted attacks (without China and India)

A. Total population

Total population is based on the definition of population, which counts all residents regardless of legal status or citizenship except for refugees not permanently settled in the country of asylum, who are generally considered part of the population of their country of origin. The values shown are midyear estimates [20]. The correlation between this type of spatial data and number of attempted attacks (without China and India) is shown in Fig. 5. The map of Internet users and number of attempted attacks is shown in Fig. 6.

Figure 6. Map of total population and attempted attacks

The linear dependence between total population and number of attempted attacks is strong for case (1), moderate for (2) and (3). In all cases models are significant by significance F. The value of R square in case (1) means that 54% of variance are explained by our regression line. It can be shown that linear coefficient is significant, so the regression line is: y=0.000939x. For model (2) 21% of variance are explained by model. Now the only significant coefficient is linear too, so regression line is y=0.000126x. For model (3) 41% of variance are explained by model. Now the only significant coefficient is linear too, so regression line is y=0.000419x.

B. Age of population

Population between the ages 0 and 14 as a percentage of the total population. Population between the ages 15 and 64 is the number of people who could potentially be economically active [20]. The correlation between age of population of countries between 15 and 64 (without China and India) and number of attempted attacks is shown in Fig. 7.

Figure 7. Graph of correlation between age of population of

countries (15-64) and number of attempted attacks (without China and India)

The linear dependence between age of population between 0 and 14 and number of attempted attacks is moderate for all cases and for age of population between 15 and 64 is strong for case (1), moderate for case (2) and (3). In all cases models are significant by significance F. The value of R square in case (1) for age of population between 0 and 14 is only 28%, on the other hand for age of population between 15 and 64 it is 60%. It can be shown that linear coefficient is significant, so the regression lines are: y=0.002837x for age of population between 0 and 14 and y=0.001424x for age of population between 15 and 64. For model (2) for both groups of ages only about 20% of variance are explained by model. Now both coefficients are significant, so regression lines are y=0.000364x+10557 for age of population between 0 and 14 and y=0.000198x+8994 for age of population between 15 and 64. Finally, for model (3) between 20% and 40% of variance are explained by model. Now the only significant coefficient is linear too, so regression lines are y=0.001133x for age of population between 0 and 14 and y=0.000665x for age of population between 15 and 64.

Pavol Sokol, Veronika Kopčová

6

C. Urban population

Urban population refers to people living in urban areas as defined by national statistical offices [20]. The correlation between this type of spatial data and number of attempted attacks (without China) is shown in Fig. 8.

Figure 8. Graph of correlation between urban population of

countries and number of attempted attacks (without China)

The linear dependence between urban population and number of attempted attacks is strong in cases with China and moderate without China. Both models are significant by significance F. The value of R square in case with China means that 71% of variance is explained by our regression line. P value for regression coefficients shows that both coefficients are significant, so the regression line is: y=0.00227x-37534. For model without China 45% of variance are explained by model. Now only linear coefficient is significant, so regression line is y=0.000457x.

D. Population in urban agglomerations and in

largest city

Population in urban agglomerations of more than one million is the country's population living in metropolitan areas that in 2000 had a population of more than one million people. On the other hand, population in largest city is the urban population living in the country's largest metropolitan area [20]. Correlations between population in urban agglomerations and number of attempted attacks (without China) are shown in Fig. 9.

The linear dependence between population in urban agglomerations and number of attempted attacks is strong in cases with China and moderate without China, on the other hand for population in largest city is moderate without China and weak with China. The value of R square in case with China for population in urban agglomerations is 67%, but for population in largest city it is only 12%. Both models are significant by significance F. It can be shown that both coefficients are significant for population in urban agglomerations, so the regression line is: y=0.005047x-42621, but for population in largest city only linear coefficient is significant, so y=0.013331x. For model without China for both groups value of R square is between 22% and 39% and the only significant coefficient is linear, so regression line is y=0.000913x for population in urban agglomerations and y=0.002915x for population in largest city.

Based on the above mentioned it can be concluded that there is a linear relationship between populations aspects of countries (total population, age, urban population and population in agglomeration and in largest city) and the number of attempted attacks. Population aspects of countries confirm above mentioned findings. In addition, active population and population in agglomeration have an impact on the number of attempted attacks.

Figure 9. Graph of correlation between population in urban

agglomerations and number of attempted attacks (without China)

VII. ECONOMIC ASPECTS

The third is the category of economic aspects. In this category we discuss four types of spatial data (Gross national income, High-technology exports, ICT goods exports and imports and Service exports and imports), which has significant correlation values with number of attempted attacks (Tab. 4). Values of spatial data (ICT goods exports, ICT goods imports, service exports and service imports) in World DataBank (year 2014) are not available for China.

TABLE IV. ECONOMIC ASPECTS

Spatial data of countries Correlation

(without China)

Correlation

(with China)

High-technology exports 0,5129 0,8792

Gross national income 0,5688 0,7131

ICT goods exports 0,3989 N/A

ICT goods imports 0,5872 N/A

ICT service exports 0,5741 N/A

Service exports 0,5579 N/A

Service imports 0,6324 N/A

Figure 10. Map of GNI and attempted attacks

Lessons learned from correlation of honeypots' data and spatial data

7

A. Gross national income

Gross national income (GNI) is the sum of values added by all resident producers and any product taxes (less subsidies) not included in the valuation of output plus net receipts of primary income (compensation of employees and property income) from abroad. Data are in constant 2005 U.S. dollars [20]. The map of GNI and number of attempted attacks is shown in Fig. 10.

The linear dependence between GNI and number of attempted attacks is strong in case with China and moderate without China. Both models are significant by significance F. The value of R square in case with China means that 51% of variance are explained by our regression line. P value for regression coefficients shows that both coefficients are significant, so the regression line is: y=2.4*10-7x-56050. For model without China 32% of variance are explained by model. Now the only significant coefficient is linear, so regression line is y=3.3*10-8x.

B. High-technology exports

High-technology exports are products with high research and development intensity, such as in computers, aerospace, scientific instruments, and electrical machinery. Data are in 2014 U.S. dollars [20]. The correlation between this type of spatial data and number of attempted attacks is shown in Fig. 11.

Figure 11. Graph of correlation between high-technology

exports and number of attempted attacks (without China)

The linear dependence between high-technology exports and attempted attacks is strong in case with China and moderate without China. Both models are significant by significance F. The value of R square in case with China means that 77% of variance is explained by our regression line. P value for regression coefficients shows that both coefficients are significant, so the regression line is: y=3.3*10-6x-33383. For model without China 26% of variance are explained by model. Now the only significant coefficient is linear, so regression line is y=5.6*10-7x.

C. ICT goods exports and imports

Information and communication technology goods exports and imports include telecommunications, computer and related equipment, audio and video and other information and communication technology goods. Software is excluded [20]. The correlation between ICT goods

imports of countries and number of attempted attacks (without China) is shown in Fig. 12.

The linear dependence between ICT goods imports and number of attempted attacks is moderate with value of R square 34% and for ICT goods exports is moderate with R square only 16%. Both models are significant by significance F. It can be shown that linear coefficient is significant for imports, so the regression line is: y=6.2*10-7x and for exports both coefficients are significant, so regression line is y=5.9*10-7x+13761.

Figure 12. Graph of correlation between ICT goods imports of

countries and number of attempted attacks (without China)

D. Service export and imports

Services refer to economic output of intangible commodities that may be produced, transferred, and consumed at the same time. Data are in 2014 U.S. dollars [20].

The linear dependence between service imports, service exports and number of attempted attacks is moderate. The value of R square is between 31% and 40%. P value for regression coefficients shows that linear coefficient is significant, so the regression lines are: y=3.3*10-7x for imports and y=2.3*10-7x for exports. Both models are significant by significance F.

Based on the above mentioned it can be concluded that there is a linear relationship between economic aspects of countries (GNI, high-technology exports, ICT exports and imports) and the number of attempted attacks. According to we´ opinion correlations with GNI and high-technology exports are closely related to higher level of infrastructure. Therefore, in these countries there is a greater number of devices to abuse. Correlation with ICT good imports confirms this claim. ICT service and service export may indicate the fact that number of attempted attacks is related to visiting the web portals, e-shop and other Internet services.

VIII. CONCLUSION

The spatial data is a source of interesting information. In paper we focus on relationship between the spatial data and attempted attacks collected by honeypots. Using GIS and mathematical statistic, we analysed the data collected by HoneyD honeypots over the year 2014. In Sections V, VI and VII we answer the questions stated in introduction. The first research question regarded relationship

Pavol Sokol, Veronika Kopčová

8

between the spatial data related to countries and number of attempted attacks. As the results suggest, there is a relationship. As we showed before, there is moderate and strong correlation between some types of spatial data (related to countries) and number of attempted of attacks (Tab. 5). The second research question regarded countries, from which attackers attack. Based on above mentioned findings, the number of attacks is related to active population who use the Internet and level of infrastructure and service provision of country. The lessons learned from analysing spatial data outline number of questions that should be answered in further research.

TABLE V. SUMMARY RESULTS

Spatial data of

countries

Correlation

(with/without

China)

% of variance are

explained

(with/without China)

Internet users strong / strong 80% / 52%

fixed broadband

and mobile cellular

subscriptions

strong / moderate 60% / 32%

Mobile cellular subscriptions

nonlinear dependence

48%

Total population strong / moderate 54% / 21%

Age of population

between 15 and 64 strong / moderate 60% / 20%

Urban population strong / moderate 71% / 45%

Population in urban

agglomerations strong / moderate 67% / 22%

population in

largest city weak / moderate 12% / 39%

Gross national

income strong / moderate 51% / 32%

High-technology

exports strong / moderate 77% / 26%

ICT goods imports N/A / moderate N/A / 34%

ICT goods exports N/A / moderate N/A / 16%

service imports N/A / moderate N/A / 31%

service exports N/A / moderate N/A / 40%

In the future, we will conduct research on spatial data of regions and cities, time zones and Internet service providers. Also we will focus on spatial analysis considering anonymous systems (e.g. TOR) to redirect attackers´ traffic.

ACKNOWLEDGMENT

We would like to thank colleagues from the Czech chapter of The Honeynet Project for their comments and valuable input. The authors would like to thank Martin Husák from Masaryk University for providing valuable data set. This paper is funded by the Slovak Grant Agency for Science (VEGA) grants under contract No. 1/0142/15 and No. 1/0344/14, VVGS projects under contract No. VVGS-PF-2015-472 and No. VVGS-PF-2016-72616 and Slovak APVV project under contract No. APVV-14-0598.

REFERENCES

[1] L. Spitzner, “The Honeynet Project: Trapping the Hackers”, IEEE Security \& Privacy, pp. 15-23. 2004.

[2] L. Spitzner, “Honeypots: Tracking Hackers”, Addison Wesley, pp. 1-430. 2002.

[3] Dionaea honeypot´s project, http://dionaea.carnivore.it/

[4] R. C. Joshi, A. Sardana, "Honeypots: A New Paradigm to Information Security", CRC Press, 2011.

[5] HonSSH honeypot´s project, https://github.com/tnich/honssh

[6] F. H., Abbasi, and R. J. Harris, “Experiences with a Generation III virtual Honeynet”, In Telecommunication Networks and Applications Conference (ATNAC), 2009. Australasian. IEEE, 2009.

[7] P. Haining, "Spatial data analysis: theory and practice," Cambridge University Press, 2003.

[8] P. Sokol, L. Kleinova, M. Husak, "Study of attack using honeypots and honeynets lessons learned from time-oriented visualization." in EUROCON 2015-International Conference on Computer as a Tool (EUROCON). IEEE. IEEE, pp. 1-6. 2015.

[9] J. Canto, M. Dacier, E. Kirda, and C. Leita, “Large scale malware collection: lessons learned,” in IEEE SRDS Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems, 2008.

[10] O. Thonnard and M. Dacier, “A framework for attack patterns’ discovery in honeynet data,” Digital Investigation, vol. 5, pp. S128–S139, 2008.

[11] V. Nicomette, M. Kaâniche, E. Alata, and M. Herrb, “Set-up and deployment of a high-interaction honeypot: experiment and lessons learned,” Journal in Computer Virology, vol. 7, no. 2, pp. 143–157, 2011.

[12] E. Alata, V. Nicomette, M. Kaaniche, M. Dacier, and M. Herrb, "Lessons learned from the deployment of a high-interaction honeypot," in Proc. Dependable Computing Conference (EDCC06), pp. 39-46, 2006.

[13] T. Sochor, M. Zuzcak, "Study of Internet Threats and Attack Methods Using Honeypots and Honeynets." Computer Networks. Springer International Publishing, pp. 118-127. 2014.

[14] T. Sochor, M. Zuzcak, "Attractiveness Study of Honeypots and Honeynets in Internet Threat Detection." Computer Networks. Springer International Publishing, pp. 69-81. 2015.

[15] M. Skrzewski, "Network Malware Activity–A View from Honeypot Systems," in Computer Networks. Springer Berlin Heidelberg, pp. 198-206. 2012.

[16] HoneyD honeypot´s project, http://www.honeyd.org/

[17] M. Husák and M. Drašar, “Flow-based Monitoring of Honeypots,” in Security and Protection of Information 2013. Brno: Univerzita obrany, pp. 63–70. 2013.

[18] R. Hofstede, P. Čeleda, B. Trammell, I. Drago, R. Sadre, A. Sperotto, and A. Pras, “Flow Monitoring Explained: From Packet Capture to Data Analysis with NetFlow and IPFIX,” Communications Surveys Tutorials, IEEE, vol. 16, no. 4, pp. 2037–2064. 2014.

[19] IP-API.com service, http://ip-api.com/

[20] The World Bank, World Development Indicators. 2014.

[21] P. Sokol, P. Pekarčík, and T. Bajtoš, "Data collection and data analysis in honeypots and honeynets." Proceedings of the Security and Protection of Information. University of Defence. Brno. 2015.

[22] ArcGIS project, https://www.arcgis.com

[23] K. Pearson, "Notes on regression and inheritance in the case of two parents," in Proceedings of the Royal Society of London 58, pp. 240–242. 1895.

[24] R. Taylor, "Interpretation of the correlation coefficient: a basic review." Journal of diagnostic medical sonography 6.1, pp. 35-39. 1990.

[25] D. Freedman, R. Pisani, R. Purves, "Statistics, 4th Edition,” W. W. Norton \& Company. 2007.