malware propagation in large-scale networksfaculty.cse.tamu.edu/guofei/paper/malsize-tkde15.pdf ·...

1

Malware Propagation in Large-Scale NetworksShui Yu, Senior Member, IEEE, Guofei Gu, Member, IEEE, Ahmed Barnawi, Mem-

ber, IEEE, Song Guo, SeniorMember, IEEE, and Ivan Stojmenovic, Fellow, IEEE

Abstract—Malware is pervasive in networks, and poses a critical threat to network security. However, we have verylimited understanding of malware behavior in networks to date. In this paper, we investigate how malware propagate innetworks from a global perspective. We formulate the problem, and establish a rigorous two layer epidemic model formalware propagation from network to network. Based on the proposed model, our analysis indicates that the distributionof a given malware follows exponential distribution, power law distribution with a short exponential tail, and power lawdistribution at its early, late and final stages, respectively. Extensive experiments have been performed through tworeal-world global scale malware data sets, and the results confirm our theoretical findings.

Index Terms—Malware, Propagation, Modelling, Power law.

F

1 INTRODUCTION

Malware are malicious software programs de-ployed by cyber attackers to compromise com-puter systems by exploiting their security vul-nerabilities. Motivated by extraordinary finan-cial or political rewards, malware owners areexhausting their energy to compromise asmany networked computers as they can inorder to achieve their malicious goals. A com-promised computer is called a bot, and allbots compromised by a malware form a botnet.Botnets have become the attack engine of cyberattackers, and they pose critical challenges tocyber defenders. In order to fight against cy-ber criminals, it is important for defenders tounderstand malware behavior, such as propa-gation or membership recruitment patterns, thesize of botnets, and distribution of bots.

S. Yu is with the School of Information Technology, Deakin Univer-sity, Burwood, VIC 3125, Australia. Email: [email protected]. Gu is with the Department of Computer Science and Engineering,Texas A&M University, Texas, USA. Email: [email protected]. Barnawi is with Faculty of Computing and IT, King AbdulazizUniversity, Jeddah, Saudi Arabia. Email:[email protected]. Guo is with the School of Computer Science and Engineering,The University of Aizu, Aizuwakamatsu, Japan. Email: [email protected]. Stojmenovic is with the School of Information Technology, DeakinUniversity, Australia; King Abdulaziz University, Jeddah, SaudiArabia; and the School of EECS, University of Ottawa, Ontario,Canada. Email: [email protected].

To date, we do not have a solid understand-ing about the size and distribution of malwareor botnets. Researchers have employed variousmethods to measure the size of botnets, suchas botnet infiltration [1], DNS redirection [2],external information [3]. These efforts indicatethat the size of botnets varies from millionsto a few thousand. There are no dominantprinciples to explain these variations. As aresult, researchers desperately desire effectivemodels and explanations for the chaos. Dagon,Zou and Lee [4] revealed that time zone hasan obvious impact on the number of availablebots. Mieghem et al. [5] indicated that networktopology has an important impact on malwarespreading through their rigorous mathematicalanalysis. Recently, the emergence of mobilemalware, such as Cabir [6], Ikee [7] , andBrador [8] , further increases the difficulty levelof our understanding on how they propagate.More details about mobile malware can befound at a recent survey paper [9]. To thebest of our knowledge, the best finding aboutmalware distribution in large-scale networkscomes from Chen and Ji [10]: the distributionis non-uniform. All this indicates that the re-search in this field is in its early stage.

The epidemic theory plays a leading role inmalware propagation modelling. The currentmodels for malware spread fall in two cate-

2

gories: the epidemiology model and the con-trol theoretic model. The control system theorybased models try to detect and contain thespread of malware [11] [12]. The epidemiologymodels are more focused on the number ofcompromised hosts and their distributions, andthey have been explored extensively in thecomputer science community [13] [14] [15]. Zouet al. [16] used a susceptible-infected (SI) modelto predict the growth of Internet worms atthe early stage. Gao and Liu [17] recently em-ployed a susceptible-infected-recovered (SIR)model to describe mobile virus propagation.One critical condition for the epidemic modelsis a large vulnerable population because theirprinciple is based on differential equations.More details of epidemic modelling can befind in [18]. As pointed by Willinger et al.[19], the findings, which we extract from a setof observed data, usually reflect parts of thestudied objects. It is more reliable to extract the-oretical results from appropriate models withconfirmation from sufficient real world data setexperiments. We practice this principle in thisstudy.

In this paper, we study the distributionof malware in terms of networks (e.g., au-tonomous systems, ISP domains, abstract net-works of smartphones who share the samevulnerabilities) at large scales. In this kind ofsetting, we have a sufficient volume of data ata large enough scale to meet the requirementsof the SI model. Different from the traditionalepidemic models, we break our model into twolayers. First of all, for a given time since thebreakout of a malware, we calculate how manynetworks have been compromised based on theSI model. Secondly, for a compromised net-work, we calculate how many hosts have beencompromised since the time that the networkwas compromised. With this two layer modelin place, we can determine the total numberof compromised hosts and their distributionin terms of networks. Through our rigorousanalysis, we find that the distribution of agiven malware follows an exponential distri-bution at its early stage, and obeys a powerlaw distribution with a short exponential tailat its late stage, and finally converges to apower law distribution. We examine our the-

oretical findings through two large-scale real-world data sets: the Android based malware[20] and the Conficker [21]. The experimentalresults strongly support our theoretical claims.To the best of our knowledge, the proposed twolayer epidemic model and the findings are thefirst work in the field.

Our contributions are summarized as fol-lows.• We propose a two layer malware propaga-

tion model to describe the development ofa given malware at the Internet level. Com-pared with the existing single layer epi-demic models, the proposed model repre-sents malware propagation better in large-scale networks.

• We find the malware distribution in termsof networks varies from exponential topower law with a short exponential tail,and to power law distribution at its early,late, and final stage, respectively. Thesefindings are firstly theoretically provedbased on the proposed model, and thenconfirmed by the experiments through thetwo large-scale real-world data sets.

The rest of the paper is structured as follows.Related work is briefly listed in Section II.We present the preliminaries for the proposedmodel in Section III. The studied problem isdiscussed in Section IV. A two layer malwarepropagation model is established in SectionV, and followed by a rigorous mathematicalanalysis in Section VI. Experiments are con-ducted to confirm our findings in Section VII.In Section VIII, we provide a further discussionabout the study. Finally, we summarize thepaper and present future work in Section IX.

2 RELATED WORK

The basic story of malware is as follows. Amalware programer writes a program, calledbot or agent, and then installs the bots atcompromised computers on the Internet usingvarious network virus-like techniques. All ofhis bots form a botnet, which is controlledby its owners to commit illegal tasks, such aslaunching DDoS attacks, sending spam emails,performing phishing activities, and collectingsensitive information. There is a command and

3

control (C&C) server(s) to communicate withthe bots and collect data from bots. In order todisguise himself from legal forces, the botmas-ter changes the url of his C&C frequently, e.g.,weekly. An excellent explanation about this canbe found in [1].

With the significant growing of smartphones,we have witnessed an increasing number ofmobile malware. Malware writers have de-velop many mobile malware in recent years.Cabir [6] was developed in 2004, and wasthe first malware targeting on the Symbianoperating system for mobile devices. Moreover,it was also the first malware propagating viaBluetooth. Ikee [7] was the first mobile malwareagainst Apple iPhones, while Brador [8] wasdeveloped against Windows CE operating sys-tems. The attack victors for mobile malware arediverse, such as SMS, MMS, Bluetooth, WiFi,and Web browsing. Peng et al. [9] presentedthe short history of mobile malware since 2004,and surveyed their propagation models.

A direct method to count the number of botsis to use botnet infiltration to count the bot IDsor IP addresses. Stone-Gross et al. [1] registeredthe URL of the Torpig botnet before the bot-master, and therefore were able to hijack theC&C server for ten days, and collect about 70Gdata from the bots of the Torpig botnet. Theyreported that the footprint of the Torpig botnetwas 182,800, and the median and average sizeof the Torpig’s live population was 49,272 and48,532, respectively. They found 49,294 newinfections during the ten days takeover. Theirresearch also indicated that the live populationfluctuates periodically as users switch betweenbeing online and offline. This issue was alsotacked by Dagon, Zou and Lee in [2].

Another method is to use DNS redirection.Dagon, Zou and Lee [2] analyzed captured botsby honypot, and then identified the C&C serverusing source code reverse engineering tools.They then manipulated the DNS entry which isrelated to a botnet’s IRC server, and redirectedthe DNS requests to a local sinkhole. Theytherefore could count the number of bots in thebotnet. As discussed previously, their methodcounts the footprint of the botnet, which was350,000 in their report.

In this paper, we use two large scale mal-

ware data sets for our experiments. Confickeris a well-known and one of the most recentlywidespread malware. Shin et al. [21] collecteda data set about 25 million Conficker victimsfrom all over the world at different levels. Atthe same time, malware targeting on Androidbased mobile systems are developing quicklyin recent years. Zhou and Zhang [20] collecteda large data set of Android based malware.

In [3], Rajab et al. pointed out that it isinaccurate to count the unique IP addresses ofbots because DHCP and NAT techniques areemployed extensively on the Internet ( [1] con-firms this by their observation that 78.9% of theinfected machines were behind a NAT, VPN,proxy, or firewall). They therefore proposed toexamine the hits of DNS caches to find thelower bound of the size of a given botnet.

Rajab et al. [22] reported that botnets canbe categorized into two major genres in termsof membership recruitment: worm-like bot-nets and variable scanning botnets. The latterweights about 82% in the 192 IRC bots that theyinvestigated, and is the more prevalent classseen currently. Such botnets usually performlocalized and non-uniform scanning, and aredifficult to track due to their intermittent andcontinuously changing behavior. The statisticson the lifetime of bots are also reported as 25minutes on average with 90% of them stayingfor less than 50 minutes.

Malware propagation modelling has beenextensively explored. Based on epidemiologyresearch, Zou et al. [16] proposed a numberof models for malware monitoring at the earlystage. They pointed out that these kinds ofmodel are appropriate for a system that con-sists of a large number of vulnerable hosts;in other words, the model is effective at theearly stage of the outbreak of malware, andthe accuracy of the model drops when themalware develops further. As a variant of theepidemic category, Sellke, Shroff and Bagchi[13] proposed a stochastic branching processmodel for characterizing the propagation ofInternet worms, which especially focuses onthe number of compromised computers againstthe number of worm scans, and presented aclosed form expression for the relationship.Dagon, Zou and Lee [4] extended the model

4

TABLE 1Notations of symbols in this paper

Notation DescriptionI(t) Number of infected hosts at time tR(t) Number of recovered hosts at time tN The total number of vulnerable hostsβ(t) The infection rate at time t

S(Li, t) Number of infected hosts of network Li at time tLj

kiThe jth network compromised at round i

of [16] by introducing time zone informationα(t), and built a model to describe the impacton the number of live members of botnets withdiurnal effect.

The impact of side information on thespreading behavior of network viruses has alsobeen explored. Ganesh et al [23] thoroughlyinvestigated the effect of network topology onthe spead of epidemics. By combining Graphtheory and a SIS (susceptible - infective - sus-ceptible) model, they found that if the ratioof cure to infection rates is smaller than thespectral radius of the graph of the studiednetwork, then the average epidemic lifetimeis of order log n, where n is the number ofnodes. On the other hand, if the ratio is largerthan a generalization of the isoperimetric con-stant of the graph, then the average epidemiclifetime is of order ena , where a is a positiveconstant. Similarly, Mieghem et al. [5] appliedthe N -intertwined Markov chain model, anapplication of mean field theory, to analyze thespread of viruses in networks. They found thatτc = 1

λmax(A), where τc is the sharp epidemic

threshold, and λmax(A) is the largest eigenvalueof the adjacency matrix A of the studied net-work. Moreover, there have been many othermethodologies to tackle the problem, such asgame theory [24].

3 PRELIMINARIES

Preliminaries of epidemic modelling and com-plex networks are presented in this section asthis work is mainly based on the two fields.

For the sake of convenience, we summarizethe symbols that we use in this paper in Table1.

3.1 Deterministic Epidemic Models

After nearly 100 years development, the epi-demic models [18] have proved effective andappropriate for a system that possesses a largenumber of vulnerable hosts. In other words,they are suitable at a macro level. Zou et al.[16] demonstrated that they were suitable forthe studies of Internet based virus propagationat the early stage.

We note that there are many factors that im-pact the malware propagation or botnet mem-bership recruitment, such as network topology,recruitment frequency, and connection status ofvulnerable hosts. All these factors contributeto the speed of malware propagation. Fortu-nately, we can include all these factors intoone parameter as infection rate β in epidemictheory. Therefore, in our study, let N be thetotal number of vulnerable hosts of a large-scale network (e.g., the Internet) for a givenmalware. There are two statuses for any one ofthe N hosts, either infected or susceptible. LetI(t) be the number of infected hosts at time t,then we have

dI(t)

dt= β(t) [N −R(t)− I(t)−Q(t)] I(t)−dR(t)

dt,

(1)where R(t), and Q(t) represent the number ofremoved hosts from the infected population,and the number of removed hosts from thesusceptible population at time t. The variableβ(t) is the infection rate at time t.

For our study, model (1) is too detailed andnot necessary as we expect to know the propa-gation and distribution of a given malware. Asa result, we employ the following susceptible-infected model.

dI(t)

dt= βI(t) [N − I(t)] (2)

where the infection rate β is a constant for agiven malware for any network.

We note that the variable t is continuous inmodel (2) and (1). In practice, we measure I(t)at discrete time points. Therefore, t = 0, 1, 2, . . ..We can interpret each time point as a newround of malware membership recruitment,such as vulnerable host scanning. As a result,we can transform model (2) into the discrete

5

form as follows.

I(t) = (1 + α∆)I(t− 1)− β∆I(t− 1)2, (3)

where t = 0, 1, 2, . . ., ∆ is the unit of time, I(0) isthe initial number of infected hosts (we also callthem seeds in this paper), and α = βN , whichrepresents the average number of vulnerablehosts that can be infected by one infected hostper time unit.

In order to simplify our analysis, let ∆ = 1, itcould be one second, one minute, one day, orone month, even one year, depending on thetime scale in a given context. Hence, we havea simpler discrete form given by

I(t) = (1 + α)I(t− 1)− β (I(t− 1))2 . (4)

Based on equation (4), we define the increaseof infected hosts for each time unit as follows.

∆I(t) , I(t)− I(t− 1), t = 1, 2, . . . (5)

To date, many researches are confined tothe “early stage” of an epidemic, such as [16].Under the early stage condition, I(t) << N ,therefore, N − I(t) ≈ N . As a result, a closedform solution is obtained as follows.

I(t) = I(0)eβNt. (6)

When we take the ln operation on both sidesof equation (6), we have

ln I(t) = βNt+ ln I(0). (7)

For a given vulnerable network, β, N andI(0) are constants, therefore, the graphical rep-resentation of equation (7) is a straight line.

Based on the definition of equation (5), weobtain the increase of new members of a mal-ware at the early stage as

∆I(t) = (eβN − 1)I(t− 1)

= (eβN − 1)I(0)eβN(t−1). (8)

Taking the ln operation on both side of (8),we have

ln ∆I(t) = βN(t− 1) + ln((eβN − 1)I(0)

). (9)

Similar to equation (7), the graphical rep-resentation of equation (9) is also a straightline. In other words, the number of recruitedmembers for each round follows an exponen-tial distribution at the early stage.

Fig. 1. The impact from infection rate β onthe recruitment progress for a given vulnerablenetwork with N = 10,000.

We have to note that it is hard for us to knowwhether an epidemic is at its early stage or notin practice. Moreover, there is no mathematicaldefinition about the term early stage.

In epidemic models, the infection rate β has acritical impact on the membership recruitmentprogress, and β is usually a small positive num-ber, such as 0.00084 for worm Code Red [13].For example, for a network with N = 10, 000vulnerable hosts, we show the recruited mem-bers under different infection rates in Figure 1.From this diagram, we can see that the recruit-ment goes slowly when β = 0.0001, however,all vulnerable hosts have been compromised inless than 7 time units when β = 0.0003, andthe recruitment progresses in an exponentialfashion.

This reflects the malware propagation stylesin practice. For malware based on “contact”,such as blue tooth contacts, or viruses depend-ing on emails to propagate, the infection rateis usually small, and it takes a long time tocompromise a large number of vulnerable hostsin a given network. On the other hand, forsome malware, which take active actions forrecruitment, such as vulnerable host scanning,it may take one or a few rounds of scanningto recruit all or a majority of the vulnerablehosts in a given network. We will apply this inthe following analysis and performance evalu-ation.

6

3.2 Complex Networks

Research on complex networks have demon-strated that the number of hosts of networksfollows the power law. People found that thesize distribution usually follows the power law,such as population in cities in a country orpersonal income in a nation [25]. In terms ofthe Internet, researchers have also discoveredmany power law phenomenon, such as the sizedistribution of web files [26]. Recent progressesreported in [27] further demonstrated that thesize of networks follows the power law.

The power law has two expression forms: thePareto distribution and the Zipf distribution.For the same objects of the power law, we canuse any one of them to represent it. However,the Zipf distributions are tidier than the expres-sion of the Pareto distributions. In this paper,we will use Zipf distributions to represent thepower law. The Zipf expression is as follows.

Pr{x = i} =C

iα, (10)

where C is a constant, α is a positive parameter,called the Zipf index, Pr{x = i} represents theprobability of the ith (i = 1, 2, . . .) largest objectin terms of size, and

∑i Pr{x = i} = 1.

A more general form of the distribution iscalled the Zipf-Mandelbrot distribution [28],which is defined as follows.

Pr{x = i} =C

(i+ q)α, (11)

where the additional constant q (q ≥ 0) is calledthe plateau factor, which makes the probabilityof the highest ranked objects flat. The Zipf-Mandelbrot distribution becomes the Zipf dis-tribution when q = 0.

Currently, the metric to say a distribution isa power law is to take the loglog plot of thedata, and we usually say it is a power lawif the result shows a straight line. We haveto note that this is not a rigorous method,however, it is widely applied in practice. Powerlaw distributions enjoy one important property,scale free. We refer interested readers to [29]about the power law and its properties.

Fig. 2. The system architecture of the studiedmalware propagation.

4 PROBLEM DESCRIPTION

In this section, we describe the malware prop-agation problem in general.

As shown in Figure 2, we study the malwarepropagation issue at two levels, the Internetlevel and the network level. We note that atthe network level, a network could be definedin many different ways, it could be an ISPdomain, a country network, the group of a spe-cific mobile devices, and so on. At the Internetlevel, we treat every network of the networklevel as one element.

At the Internet level, we suppose, there areM networks, each network is denoted as Li(1 ≤i ≤ M). For any network Li, we suppose itphysically possesses Ni hosts. Moreover, wesuppose the possibility of vulnerable hosts ofLi is denoted as pi(0 ≤ pi ≤ 1). In general, itis highly possible that Ni 6= Nj , and pi 6= pj fori 6= j, 1 ≤ i, j ≤M . Moreover, due to differencesin network topology, operating system, securityinvestment and so on, the infection rates aredifferent from network to network. We denoteit as βi for Li. Similarly, it is highly possiblethat βi 6= βj for i 6= j, 1 ≤ i, j ≤M .

For any given network Li with pi ·Ni vulner-able hosts and infection rate βi. We suppose themalware propagation starts at time 0. Based onequation (4), we obtain the number of infectedhosts, Ii(t), of Li at time t as follows.

Ii(t) = (1 + αi)Ii(t− 1)− βi(Ii(t− 1))2 (12)= (1 + βipiNi)Ii(t− 1)− βi(Ii(t− 1))2

In this paper, we are interested in a global

7

sense of malware propagation. We study thefollowing question.

For a given time t since the outbreak of amalware, what are the characteristics of thenumber of compromised hosts for each net-work in the view of the whole Internet. In otherwords, to find a function F about Ii(t)(1 ≤ i ≤M). Namely, the pattern of

F (I1(t), I2(t), . . . , IM(t)) . (13)

For simplicity of presentation, we use S(Li, t)to replace Ii(t) at the network level, and I(t)is dedicated for the Internet level. Followingequation (13), for any network Li(1 ≤ i ≤ M),we have

S(Li, t) = (1+βipiNi)S(Li, t−1)−βi (S(Li, t− 1))2 .(14)

At the Internet level, we suppose there arek1, k2, . . . , kt networks that have been compro-mised at each round for each time unit from 1to t. Any ki(1 ≤ i ≤ t) is decided by equation(4) as follows

ki = (1 + βnM)I(i− 1)− βn (I(i− 1))2 , (15)

where M is the total number of networks overthe Internet, and βn is the infection rate amongnetworks. Moreover, suppose the number ofseeds, k0, is known.

At this time point t, the landscape of thecompromised hosts in terms of networks is asfollows.

S(L1k1, t), S(L2

k1, t), . . . , S(Lk1k1 , t)︸︷︷︸k1

S(L1k2, t− 1), S(L2

k2, t− 1), . . . , S(Lk2k2 , t− 1)︸︷︷︸k2. . .

S(L1kt , 1), S(L2

kt , 1), . . . , S(Lktkt , 1)︸︷︷︸kt

, (16)

where Ljki represents the jth network that wascompromised at round i. In other words, thereare k1 compromised networks, and each ofthem have progressed t time units; k2 com-promised networks, and each of them has pro-gressed t − 1 time units; and kt compromisednetworks, and each of them have progressed 1time unit.

It is natural to have the total number ofcompromised hosts at the Internet level as

I(t) = S(L1k1, t) + S(L2

k1, t) + . . .+ S(Lk1k1 , t)︸︷︷︸k1

+ S(L1k2, t− 1) + . . .+ S(Lk2k2 , t− 1)︸︷︷︸

k2+ . . . (17)+ S(L1

kt , 1) + S(L2kt , 1) + . . .+ S(Lktkt , 1)︸︷︷︸kt

Suppose ki(i = 1, 2, . . .) follows one distri-bution with a probability distribution of pn (nstands for number), and the size of a compro-mised network, S(Li, t), follows another prob-ability distribution of ps (s stands for size). LetpI be the probability distribution of I(t)(t =0, 1, . . .). Based on equation (18), we find pI isexactly the convolution of pn and ps.

pI = pn ~ ps, (18)

where ~ is the convolution operation.Our goal is to find a pattern of pI of equation

(18).

5 MALWARE PROPAGATIONMODELLING

As shown in Figure 2, we abstract the M net-works of the Internet into M basic elements inour model. As a result, any two large networks,Li and Lj (i 6= j), are similar to each other atthis level. Therefore, we can model the studiedproblem as a homogeneous system. Namely, allthe M networks share the same vulnerabilityprobability (denoted as p), and the same infec-tion rate (denoted as β). A simple way to obtainthese two parameters is to use the means. p = 1

M

∑Mi=1 pi

β = 1M

∑Mi=1 βi

(19)

For any network Li, let Ni be the total num-ber of vulnerable hosts, then we have

Ni = p · Ni, i = 1, 2, . . . ,M, (20)

where Ni is the total number of computers ofnetwork Li.

8

As discussed in Section III, we know thatNi(i = 1, 2, . . . ,M) follows the power law. Asp is a constant in equation (20), then Ni(i =1, 2, . . . ,M) follows the power law as well.Without loss of generality, let Li represent theith network in terms of total vulnerable hosts(Ni). Based on the Zipf distribution, if we ran-domly choose a network X , the probability thatit is network Lj is

Pr{X = Lj} = pz(j) =Nj∑Mi=1Ni

=C

jα(21)

Equation (21) shows clearly that a networkwith a larger number of vulnerable hosts has ahigher probability to be compromised.

Following equation (18), at time t, we havek1 + k2 + . . .+ kt networks that have been com-promised. Combining with equation (21), ingeneral, we know the first round of recruitmenttakes the largest k1 networks, and the secondround takes the k2 largest networks among theremaining networks, and so on. We thereforecan simplify equation (18) as

I(t) =

k1∑j=1

S(Nj, t)pz(j)

+

k2∑j=1

S(Nk1+j, t− 1)pz(k1 + j)

+ . . . (22)

+kt∑j=1

S(Nk1+...+kt−1+j, 1)

·pz(k1 + . . .+ kt−1 + j).

From equation (22), we know the total num-ber of compromised hosts and their distribu-tion in terms of networks for a given time pointt.

6 ANALYSIS ON THE PROPOSED MAL-WARE PROPAGATION MODEL

In this section, we try to extract the pattern ofI(t) in terms of S(Li, t

′), or pI of equation (18).

We make the following definitions before weprogress for the analysis.

1) Early stage. An early stage of the breakoutof a malware means only a small percent-age of vulnerable hosts have been compro-mised, and the propagation follows expo-nential distributions.

2) Final stage. The final stage of the propaga-tion of a malware means that all vulner-able hosts of a given network have beencompromised.

3) Late stage. A late stage means the timeinterval between the early stage and thefinal stage.

We note that many researches are focused onthe early stage, and we define the early stageto meet the pervasively accepted condition, wecoin the other two terms for the convenienceof our following discussion. Moreover, we setvariable Te as the time point that a malware’sprogress transfers from its early stage to latestage. In terms of mathematical expressions, weexpress the early, late and final stage as 0 ≤ t <Te, Te ≤ t <∞, and t =∞, respectively.

Due to the complexity of equation (22), itis difficult to obtain conclusions in a dynamicstyle. However, we are able to extract someconclusions under some special conditions.

Lemma 1. If distributions p(x) and q(x) followexponential distributions, then p(x) ~ q(x) followsan exponential distribution as well.

Due to the space limitation, we skip the proofand refer interested readers to [30].

At the early stage of a malware breakout, wehave advantages to obtain a clear conclusion.

Theorem 1. For large scale networks, such as theInternet, at the early stage of a malware propaga-tion, the malware distribution in terms of networksfollows exponential distributions.

Proof: At a time point of the early stage(0 ≤ t < Te) of a malware breakout, followingequation (6), we obtain the number of compro-mised networks as

I(t) = I(0)eβnMt. (23)

It is clear that I(t) follows an exponential dis-tribution.

For any of the compromised networks, wesuppose it has progressed t

′(0 < t

′ ≤ t < Te)

9

TABLE 2The number of different Android malware against time (months) in 2010-2011

Time point 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Variants 13 26 39 53 71 94 127 193 259 374 583 986 1,513 2,191 3,451

time units, and its size is

S(Li, t′) = Ii(0)eβNit

′

. (24)

Based on equation (24), we find that the sizeof any compromised network follows an expo-nential distribution. As a result, all the sizesof compromised networks follow exponentialdistributions at the early stage.

Based on Lemma 1, we obtain that the mal-ware distribution in terms of network followsexponential distributions at its early stage.

Moreover, we can obtain concrete conclusionof the propagation of malware at the finalstage.

Theorem 2. For large scale networks, such as theInternet, at the final stage (t = ∞) of a malwarepropagation, the malware distribution in terms ofnetworks follows the power law distribution.

Proof: At the final stage, all vulnerablehosts have been compromised, namely,

S(Li,∞) = Ni, i = 1, 2, . . . ,M

Based on our previous discussion, we knowNi(i = 1, 2, . . . ,M) follows the power law. As aresult, the theorem holds.

Now, we move our study to the late stage ofmalware propagation.

Theorem 3. For large scale networks, such as theInternet, at the late stage (Te ≤ t < ∞) of amalware breakout, the malware distribution includetwo parts: a dominant power law body and a shortexponential tail.

Proof: Suppose a malware propagation hasprogressed for t(t >> Te) time units. Let t′ =t− Te. If we separate all the compromised I(t)hosts by time point t′, we have two groups ofcompromised hosts.

Following Theorem 2, as t′ >> Te, the com-promised hosts before t′ follows the power law.

At the same time, all the compromised net-works after t′ are still in their early stage. There-fore, these recently compromised networks fol-low exponential distributions.

Now, we need to prove that the networkscompromised after time point t′ are at the tail ofthe distribution. First of all, for a given networkLi, for t1 > t2, we have

S(Li, t1) ≥ S(Li, t2) (25)

For two networks, Li and Lj , if Ni ≥ Nj ,then Li should be compromised earlier than Lj .Combining this with (25), we know the latercompromised networks usually lie at the tailof the distribution.

Due to the fact that t′ >> Te, the length ofthe exponential tail is much shorter than thelength of the main body of the distribution.

7 PERFORMANCE EVALUATION

In this section, we examine our theoreticalanalysis through two well-known large-scalemalware: Android malware and Conficker. An-droid malware is a recent fast developing anddominant smartphone based malware [20]. Dif-ferent from Android malware, the Confickerworm is an Internet based state-of-the-art bot-net [21]. Both the data sets have been widelyused by the community.

From the Android malware data set, we havean overview of the malware development fromAugust 2010 to October 2011. There are 1260samples in total from 49 different Android mal-ware in the data set. For a given Android mal-ware program, it only focuses on one or a num-ber of specific vulnerabilities. Therefore, allsmartphones share these vulnerabilities form aspecific network for that Android malware. Inother words, there are 49 networks in the dataset, and it is reasonable that the population ofeach network is huge. We sort the malwaresubclasses according to their size (number ofsamples in the data set), and present them in

10

Fig. 3. The probability distribution of Androidmalware in terms of networks.

a loglog format in Figure 3, the diagram isroughly a straight line. In other words, we cansay that the Android malware distribution interms of networks follows the power law.

We now examine the growth pattern of to-tal number of compromised hosts of Androidmalware against time, namely, the pattern ofI(t). We extract the data from the data set andpresent it in Table 2. We further transform thedata into a graph as shown in Figure 4. Itshows that the member recruitment of Androidmalware follows an exponential distributionnicely during the 15 months time interval. Wehave to note that our experiments also indicatethat this data does not fit the power law (we donot show them here due to space limitation).

In Figure 4, we match a straight line to thereal data through the least squares method.Based on the data, we can estimate that thenumber of seeds (I(0)) is 10, and α = 0.2349.Following our previous discussion, we inferthat the propagation of Android malware wasin its early stage. It is reasonable as the size ofeach Android vulnerable network is huge andthe infection rate is quite low (the infection isbasically based on contacts).

We also collected a large data set of Confickerfrom various aspects. Due to the space limita-tion, we can only present a few of them hereto examine our theoretical analysis.

First of all, we treat Autonomous Systems (AS)as networks in the Internet. In general, ASs are

Fig. 4. The growth of total compromised hostsby Android malware against time from August2010 to October 2011.

TABLE 3Statistics for Conficker distribution in terms of

ASs

Number of ASes Largest botnet Smallest botnet1,0048 2,825,403 1

large scale elements of the Internet. A few keystatistics from the data set are listed in Table3. We present the data in a loglog format inFigure 5, which indicates that the distributiondoes follow the power law.

A unique feature of the power law is the

Fig. 5. Power law distribution of Conficker interms of autonomous networks.

11

TABLE 4Statistics for Conficker distribution in terms of

domain names at the three top levels

Number of botnets Largest botnet Smallest botnettop level 462 2,201,183 1level 1 20,104 1,718,306 1level 2 96,756 1,714,283 1

TABLE 5The last six elements of Conficker botnet from

the top three domain name levels.

t=1 t=2 t=3 t=4 t=5 t=6top level 9 14 18 15 22 68level 1 543 686 924 1,534 2,972 7,898level 2 3,461 4,085 5,234 7,451 13,002 33,522

scale free property. In order to examine thisfeature, we measure the compromised hostsin terms of domain names at three differentdomain levels: the top level, level 1, and level 2,respectively. Some statistics of this experimentare listed in Table 4.

Once again, we present the data in a loglogformat in Figure 6 (a), (b) and (c), respectively.The diagrams show that the main body ofthe three scale measures are roughly straightlines. In other words, they all fall into powerlaw distributions. We note that the flat headin Figure 6 can be explained through a Zipf-Mandelbrot distribution. Therefore, Theorem 2holds.

In order to examine whether the tails areexponential, we take the smallest 6 data fromeach tail of the three levels. It is reasonable tosay that they are the networks compromised atthe last 6 time units, the details are listed inTable 5 (we note that t = 1 is the sixth last timepoint, and t = 6 is the last time point).

When we present the data of Table 5 into agraph as shown in Figure 7, we find that theyfit an exponential distribution very well, espe-cially for the level 2 and level 3 domain namecases. This experiment confirms our claim inTheorem 3.

8 FURTHER DISCUSSION

In this paper, we have explored the problem ofmalware distribution in large-scale networks.

Fig. 7. The three tails from the three domainname levels fit exponential distributions.

There are many directions that could be fur-ther explored. We list some important ones asfollows.

1) The dynamics of the late stage. We havefound that the main body of malware dis-tribution follows the power law with ashort exponential tail at the late stage. It isvery attractive to explore the mathematicalmechanism of how the propagation leadsto such kinds of mixed distributions.

2) The transition from exponential distribu-tion to power law distribution. It is nec-essary to investigate when and how amalware distribution moves from an ex-ponential distribution to the power law. Inother words, how can we clearly define thetransition point between the early stageand the late stage.

3) Multiple layer modelling. We hire the fluidmodel in both of the two layers in ourstudy as both layers are sufficiently largeand meet the conditions for the modellingmethods. In order to improve the accuracyof malware propagation, we may extendour work to n(n > 2) layers. In another sce-nario, we may expect to model a malwaredistribution for middle size networks, e.g.an ISP network with many subnetworks.In these cases, the conditions for the fluidmodel may not hold. Therefore, we needto seek suitable models to address the

12

Fig. 6. Power law distribution of Conficker botnet in the top three levels of domain names.

problem.4) Epidemic model for the proposed two

layer model. In this paper, we use the SImodel, which is the simplest for epidemicanalysis. More practical models, e.g. SISor SIR, could be chosen to serve the sameproblem.

5) Distribution of coexist multiple malwarein networks. In reality, multiple malwaremay coexist at the same networks. Due tothe fact that different malware focus ondifferent vulnerabilities, the distributionsof different malware should not be thesame. It is challenging and interesting toestablish mathematical models for multi-ple malware distribution in terms of net-works.

9 SUMMARY AND FUTURE WORK

In this paper, we thoroughly explore the prob-lem of malware distribution at large-scale net-works. The solution to this problem is desper-ately desired by cyber defenders as the networksecurity community does not yet have solidanswers. Different from previous modellingmethods, we propose a two layer epidemicmodel: the upper layer focuses on networks ofa large scale networks, for example, domainsof the Internet; the lower layer focuses onthe hosts of a given network. This two layermodel improves the accuracy compared withthe available single layer epidemic models inmalware modelling. Moreover, the proposed

two layer model offers us the distribution ofmalware in terms of the low layer networks.

We perform a restricted analysis based on theproposed model, and obtain three conclusions:The distribution for a given malware in termsof networks follows exponential distribution,power law distribution with a short exponen-tial tail, and power law distribution, at itsearly, late, and final stage, respectively. In orderto examine our theoretical findings, we haveconducted extensive experiments based on tworeal-world large-scale malware, and the resultsconfirm our theoretical claims.

In regards to future work, we will firstlyfurther investigate the dynamics of the latestage. More details of the findings are expectedto be further studied, such as the length of theexponential tail of a power law distribution atthe late stage. Secondly, defenders may caremore about their own network, e.g., the distri-bution of a given malware at their ISP domains,where the conditions for the two layer modelmay not hold. We need to seek appropriatemodels to address this problem. Finally, weare interested in studying the distribution ofmultiple malware on large-scale networks aswe only focus on one malware in this paper.We believe it is not a simple linear relationshipin the multiple malware case compared to thesingle malware one.

13

ACKNOWLEDGMENT

Dr Yu’s work is partially supported by theNational Natural Science Foundation of China(grant No. 61379041), Prof. Stojmenovic’s workis partially supported by NSERC Canada Dis-covery grant (grant No. 41801-2010 ), and KAUDistinguished Scientists Program.

REFERENCES

[1] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szyd-lowski, R. Kemmerer, C. Kruegel, and G. Vigna, “Yourbotnet is my botnet: Analysis of a botnet takeover,” inCCS ’09: Proceedings of the 2009 ACM conference on computercommunication security, 2009.

[2] D. Dagon, C. Zou, and W. Lee, “Modeling botnet propaga-tion using time zones,” in Proceedings of the 13 th Networkand Distributed System Security Symposium NDSS, 2006.

[3] M. A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis, “Mybotnet is bigger than yours (maybe, better than yours):why size estimates remain challenging,” in Proceedingsof the first conference on First Workshop on Hot Topics inUnderstanding Botnets, 2007.

[4] D. Dagon, C. C. Zou, and W. Lee, “Modeling botnetpropagation using time zones,” in NDSS, 2006.

[5] P. V. Mieghem, J. Omic, and R. Kooij, “Virus spread innetworks,” IEEE/ACM Transactions on Networking, vol. 17,no. 1, pp. 1–14, 2009.

[6] Cabir, http://www.f-secure.com/en/web/labs global/2004-threat-summary.

[7] Ikee, http://www.f-secure.com/v-descs/worm iphoneos ikee b.shtml.

[8] Brador, http://www.f-secure.com/v-descs/brador.shtml.[9] S. Peng, S. Yu, and A. Yang, “Smartphone malware and its

propagation modeling: A survey,” IEEE CommunicationsSurveys and Tutorials, in press, 2014.

[10] Z. Chen and C. Ji, “An information-theoretic view ofnetwork-aware malware attacks,” IEEE Transactions onInformation Forensics and Security, vol. 4, no. 3, pp. 530–541, 2009.

[11] A. M. Jeffrey, xiaohua Xia, and I. K. Craig, “When toinitiate hiv therapy: A control theoretic approach,” IEEETransactions on Biomedical Engineering, vol. 50, no. 11, pp.1213–1220, 2003.

[12] R. Dantu, J. W. Cangussu, and S. Patwardhan, “Fast wormcontainment using feedback control,” IEEE Transactions onDependable and Secure Computing, vol. 4, no. 2, pp. 119–136,2007.

[13] S. H. Sellke, N. B. Shroff, and S. Bagchi, “Modeling andautomated containment of worms,” IEEE Trans. Depend-able Sec. Comput., vol. 5, no. 2, pp. 71–86, 2008.

[14] P. De, Y. Liu, and S. K. Das, “An epidemic theoretic frame-work for vulnerability analysis of broadcast protocolsin wireless sensor networks,” IEEE Trans. Mob. Comput.,vol. 8, no. 3, pp. 413–425, 2009.

[15] G. Yan and S. Eidenbenz, “Modeling propagation dynam-ics of bluetooth worms (extended version),” IEEE Trans.Mob. Comput., vol. 8, no. 3, pp. 353–368, 2009.

[16] C. C. Zou, W. Gong, D. Towsley, and L. Gao, “The moni-toring and early detection of internet worms,” IEEE/ACMTrans. Netw., vol. 13, no. 5, pp. 961–974, 2005.

[17] C. Gao and J. Liu, “Modeling and restraining mobile viruspropagation,” IEEE Trans. Mob. Comput., vol. 12, no. 3, pp.529–541, 2013.

[18] D. J. Daley and J. Gani, Epidemic Modelling: An Introduc-tion. Cambridge University, 1999.

[19] W. Willinger, D. Alderson, and J. C. Doyle, “Mathematicsand the internet: A source of enormous confusion andgreat potential,” Notices of the Ameriacan MathematicalSocieity, vol. 56, no. 5, pp. 586–599, 2009.

[20] Y. Zhou and X. Jiang, “Dissecting android malware:Characterization and evolution,” in IEEE Symposium onSecurity and Privacy, 2012, pp. 95–109.

[21] S. Shin, G. Gu, A. L. N. Reddy, and C. P. Lee, “A large-scale empirical study of conficker,” IEEE Transactions onInformation Forensics and Security, vol. 7, no. 2, pp. 676–690,2012.

[22] M. A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis, “Amultifaceted approach to understanding the botnet phe-nomenon,” in Internet Measurement Conference, 2006, pp.41–52.

[23] A. J. Ganesh, L. Massoulie, and D. F. Towsley, “The effectof network topology on the spread of epidemics,” inINFOCOM, 2005, pp. 1455–1466.

[24] J. Omic, A. Orda, and P. V. Mieghem, “Protecting againstnetwork infections: A game theoretic perspective,” inINFOCOM’09, 2009.

[25] R. L. Axtell, “Zipf distribution of u.s. firm sizes,” Science,vol. 293, 2001.

[26] M. Mitzenmacher, “A brief history of generative modelsfor power law and lognornal distributions,” Internet Math-ematics, vol. 1, 2004.

[27] M. Newman, Networks, An Introduction. Oxford Univer-sity Press, 2010.

[28] Z. K. Silagadze, “Citations and the zipf-mandelbrot’slaw,” Complex Systems, vol. 11, pp. 487–499, 1997.

[29] M. E. J. Newman, “Power laws, pareto distributions andzipf’s law,” Contemporary Physics, vol. 46, pp. 323–351,December 2005.

[30] L. Kleinrock, Queueing Systems. Wiley Interscience, 1975,vol. I: Theory.

14

Shui Yu (M’05-SM’12) received the B.Eng.and M.Eng. degrees from University ofElectronic Science and Technology ofChina, Chengdu, P. R. China, in 1993 and1999, respectively, and the Ph.D. degreefrom Deakin University, Victoria, Australia,in 2004. He is currently a Senior Lecturerwith the School of Information Technology,Deakin University, Victoria, Australia. He

has published nearly 100 peer review papers, including topjournals and top conferences, such as IEEE TPDS, IEEE TIFS,IEEE TFS, IEEE TMC, and IEEE INFOCOM. His researchinterests include networking theory, network security, and math-ematical modeling.Dr. Yu actively servers his research communities in variousroles, which include the editorial boards of IEEE Transactionson Parallel and Distributed Systems, IEEE CommunicationsSurveys and Tutorials, and IEEE Access, IEEE INFOCOM TPCmembers 2012-2015, symposium co-chairs of IEEE ICC 2014,IEEE ICNC 2013-2015, and many different roles of internationalconference organizing committees. He is a senior member ofIEEE, and a member of AAAS.

Guofei Gu (S’06-M’08) received the Ph.D.degree in computer science from the Col-lege of Computing, Georgia Institute ofTechnology. He is an assistant professorin the Department of Computer Scienceand Engineering, Texas A&M University(TAMU), College Station, TX. His researchinterests are in network and system se-curity,such as malware analysis, detection,

defense, intrusion and anomaly detection, and web and socialnetworking security. He is currently directing the Secure Com-munication and Computer Systems (SUCCESS) Laboratory atTAMU. Dr. Gu is a recipient of the 2010 NSF CAREER awardand a corecipient of the 2010 IEEE Symposium on Security andPrivacy (Oakland10) best student paper award.

Ahmed Barnawi is an associate profes-sor at the Faculty of Computing and IT,King Abdulaziz University, Jeddah, SaudiArabia, where he works since 2007. Hereceived PhD at the University of Bradford,UK in 2006. He was visiting professor at theUniversity of Calgary in 2009. His researchareas are cellular and mobile communica-tions, mobile ad hoc and sensor networks,

cognitive radio networks and security. He received three strate-gic research grants and registered two patents in the US. He isa member of IEEE.

Song Guo (M’02-SM’11) received the PhDdegree in computer science from the Uni-versity of Ottawa, Canada in 2005. Heis currently a Senior Associate Professorat School of Computer Science and En-gineering, the University of Aizu, Japan.His research interests are mainly in theareas of protocol design and performanceanalysis for reliable, energy-efficient, and

cost effective communications in wireless networks. Dr. Guo isan associate editor of the IEEE Transactions on Parallel andDistributed Systems and an editor of Wireless Communicationsand Mobile Computing. He is a senior member of the IEEE andthe ACM.

Ivan Stojmenovic was editor-in-chief ofIEEE Transactions on Parallel and Dis-tributed Systems (2010-3), and is founderof three journals. He is editor of IEEETransactions on Computers, IEEE Net-work, IEEE Transactions on Cloud Com-puting and ACM Wireless Networks andsteering committee member of IEEE Trans-actions on Emergent Topics in Computing.

Stojmenovic is on Thomson Reuters list of Highly Cited Re-searchers from 2013, has top h-index in Canada for mathe-matics and statistics, and has more than 15000 citations. Hereceived five best paper awards. He is Fellow of IEEE, CanadianAcademy of Engineering and Academia Europaea. He receivedHumboldt Research Award.

malware propagation in large-scale networksfaculty.cse.tamu.edu/guofei/paper/malsize-tkde15.pdf ·...

Documents