a multi-faceted approach to countering internet...

A Multi-faceted Approach to Countering Internet Threats

by

Sandeep Sarat

A dissertation submitted to The Johns Hopkins University inconformity with the

requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland

May, 2008

c© Sandeep Sarat 2008

All rights reserved

Abstract

While the Internet has revolutionalized the communicationlandscape, it continues to

be plagued by issues of robustness and security that threaten the network’s operation. In

this dissertation, we employ a multi-faceted approach to tackle these issues. Specifically,

measurement studies are performed to experimentally quantify the ground truth related to

the robustness of the Internet infrastructure and securitythreats in the wild. This work is

complemented by research on future Internet threats and mechanisms to contain them. The

source of these threats are novel classes of malware which exploit the increasing attack

surface, presented by rapidly evolving Internet technologies.

As an instance of quantifying the emerging security threatsin the wild, we performed

a wide-scale measurement study of the Storm botnet. This botnet, which represents the

leading edge in botnet technology, uses a distributed peer-to-peer(P2P) architecture and

aggressively defends itself. We developed a crawler which actively scourged the P2P bot-

net, to determine its size and other structural properties.This study shows how traditional

P2P distributed hash tables (DHTs) differ from botnet DHTs,eventhough they use the same

underlying P2P protocol - mainly due to the diverse interests of participating entities within.

ii

ABSTRACT

In particular, we could precisely identify nodes which poison the DHT index, using simple

heuristics. Such a capability is particularly alarming to the security community, consider-

ing that the storm botnet aggressively defends itself by carrying out DDoS attacks. The

bigger implication of this study is that current botnet monitoring techniques easily lend

themselves to miscreant counter-intelligence techniques, thereby motivating the need for

stealthier monitoring techniques.

Since network defense is essentially an arms race, we not only address current threats,

but also look ahead to threats that are likely to arise in the future. We show that, as mo-

bile devices become pervasive and more powerful, malware can exploit their mobility pat-

terns to trivially propagate around perimeter defenses such as firewalls. Using an analytical

model, we estimate the speed with which such infections can propagate over a population of

nomadic users. We validate our results using realistic mobility traces from a campus-wide

wireless network with hundreds of access points and thousands of mobile users. We show

that, while the speed of propagation of mobile malware is slower when compared to tradi-

tional Internet worms, it is still fast enough to render manual countermeasures implausible.

Furthermore, we develop a novel probabilistic technique which advocates using a modified

version ofrandom moonwalksto provide early detection of such mobile malware. The

proposed technique can reliably detect and pinpoint the origin of a mobile infection in the

early stages of its evolution itself.

As another direction in countering threats, we address vulnerabilities of the web

browser. The browser is the single most widely used application on the Internet today.

iii

ABSTRACT

However, its security policies are largely antiquated in today’s increasingly multi-principal,

asynchronous programming model of the web. We develop two novel abstractions - one

for sand-boxing untrusted third party contents and anotherwhich enables controlled shar-

ing between domains, to transform the browser into a truly multi-principal platform.

Given that worms pose a global scale threat to the Internet infrastructure, an evalua-

tion of robustness, quantifying the ability of the Internetto withstand attacks, is essential.

Specifically, we measured the performance of anycast on fourtop-level domain name ser-

vice (DNS) zones, which allowed us to quantify the reliability and resiliency of the DNS

zones against large scale distributed denial of service (DDoS). We showed that outages in

DNS service are indeed rare. However, when they do occur, outages can last upto multiple

minutes, mainly due to slow Border Gateway Protocol(BGP) convergence

At the other end of the spectrum, Internet routers can also bethe subjects of DDoS

attacks. One such attack, known as the shrew attack, consists of small traffic bursts that

can temporarily inundate a router’s queue while at the same time, evade detection due to

their low average transmission rate. Using simple mathematical analysis and simulation,

we show that a relatively small buffer, combined with a fair queue management scheme, is

sufficient to detect/thwart low rate TCP attacks against routers.

Advisor: Dr. Andreas Terzis, Department of Computer Science, JohnsHopkins University

Primary Reader: Dr. Gerald M. Masson, Department of Computer Science, Johns Hop-

kins University

iv

ABSTRACT

Secondary Reader:Dr. Cristina Nita-Rotaru, Department of Computer Science, Purdue

University

v

Acknowledgements

My first, and most earnest, acknowledgements must go to my advisor Prof. Andreas

Terzis. He took me under his wing, when I was at a crossroads inthe pursuit of the doctorate

degree in 2004. Thereafter, he has been one of the most affable advisors, I have ever known.

I shall remain grateful to him for granting me freedom and guiding me in the research topics

of my fancy.

I wish to thank Professor Jonathan Shapiro, for his extremely insightful rants and intro-

ducing me to the world of systems and security in the early days of my PhD. I also wish to

thank Professor Rao Kosaraju, for providing me with an opportunity to TA the Randomized

Algorithms course, multiple times. My interactions with Prof. Shapiro and Prof. Kosaraju,

have taught me the importance of a maintaining a balanced perspective, involving both a

theoretical and a systemic standpoint. I also wish to thank Dr. Gerald M. Masson and Dr.

Cristina Nita-Rotaru for their valuable critiques while serving on my defense committee.

I have been part of the Hopkins Internetworking Group(HiNRG) during the past five

years. The meticulously maintained schedule of each of its inhabitants will remain etched

in my brain forever. I thank Razvan Musaloiu-E., for all his technical help, the monthly

vi

ACKNOWLEDGEMENTS

photos shoots, and the chocolates, Moheeb Abu Rajab for being the only other colleague in

HiNRG pursuing network security, Chieh-Lan Mike Liang for adhering to my dark room

policy and listening to my nonstop drivel, Jeongil (John)ko, Yin Chen and Sam Small for

putting up with me.

The time spent outside of the department, be it on the lush meadows of the upper quad,

in #343, in the JHUCC van or the basement music room have been memorable owing

largely to - Amit Paliwal for his witty remarks, Puneet Bajpai, Utkarsh Sharma, Paritosh

Shroff for their camaraderie, Dheeraj Singaraju, Ashley Fernandes, Supratim Ray, and

Purshottam Dixit for jamming along withHallowed be thy name, ever so often, Kishore

Kothapalli, Anshumal Sinha, Sridhar Swaroop and Pramod Singh Thakur for all the dis-

cussions on diverse subjects, and Ranganath Teki, Santosh Vijaykumar, Piyush Jain and

Saurabh Paliwal for being my roommates. Finally a big thanksto Harris(1806-1860) who

egged me on my way to the lab daily.

The acknowledgment of the long-term support of my parents, Saratchandran and

Meera, and my sister, Sapna, may be ritual in a pursuit of thissort, but is nonetheless

necessary, apt and heartfelt.

vii

Contents

Abstract ii

Acknowledgements vi

List of Tables xiii

List of Figures xiv

1 Introduction 1

1.1 Motivation and a Brief Chronology . . . . . . . . . . . . . . . . . . .. . . 2

1.2 Thesis Contribution and Outline . . . . . . . . . . . . . . . . . . . .. . . 3

1.2.1 Tracking P2P Botnets . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.2 Using Mobility to Propagate Malware . . . . . . . . . . . . . . .. 5

1.2.3 Isolation and Sharing in the Web Browser . . . . . . . . . . . .. . 6

1.2.4 Anycast in DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.5 Router Buffer and Robustness . . . . . . . . . . . . . . . . . . . . 8

viii

CONTENTS

2 On Tracking P2P Botnets 10

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Command and Control . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.2 Encrypted Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . .16

2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Node ID distribution . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 Population Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Relationship between peer addresses and identifiers .. . . . . . . . 20

2.3.4 Data from Spam Block Lists . . . . . . . . . . . . . . . . . . . . . 23

2.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 On Using Mobility to Propagate Malware 30

3.1 Worm Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Mobility Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Mobile node infection . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Mixing mobile and static nodes . . . . . . . . . . . . . . . . . . . 39

3.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

CONTENTS

3.4 Spatial evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5.1 Popularity is dynamic . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.2 Evasive worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . .49

3.7 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . .. 51

4 On the Detection and Origin Identification of Mobile Worms 53

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Random Moonwalks . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Mobile Worm Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Random Moonwalks and Mobile Worms . . . . . . . . . . . . . . 57

4.2.2 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.3 Effect of infection on moonwalk length . . . . . . . . . . . . .. . 63

4.3 Worm Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 On Web Browser Protection 75

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

x

CONTENTS

5.1.1 Same Origin Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1.2 XSS attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Trust Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Konqueror Implementation . . . . . . . . . . . . . . . . . . . . . . . . .. 82

5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 On the Use of Anycast in DNS 87

6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . .92

6.3 Anycast Deployment Strategies . . . . . . . . . . . . . . . . . . . . .. . . 96

6.3.1 Multiple Instances, One site: B-Root . . . . . . . . . . . . . .. . 96

6.3.2 Multiple Instances, Multiple Heterogeneous Sites: F,K-root . . . . 97

6.3.3 Multiple Instances, Multiple Homogeneous Sites: UltraDNS . . . . 99

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.4.1 Response times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.3 Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4.4 Effectiveness of Localization . . . . . . . . . . . . . . . . . . .. . 115

6.4.5 Comparison of Deployment Strategies . . . . . . . . . . . . . .. . 119

6.5 Effect of Advertisement Radius . . . . . . . . . . . . . . . . . . . . .. . . 120

6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

xi

CONTENTS

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7 On the Effect of Router Buffer Sizes on Low-Rate Denial of Service Attacks 127

7.1 The Shrew Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.2 Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .131

7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.3.1 Low Speed Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.3.2 High Speed Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

8 Future Work 144

8.0.1 Botnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.0.2 Mobile Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.0.3 Web based Malware . . . . . . . . . . . . . . . . . . . . . . . . . 147

Bibliography 148

Vita 165

xii

List of Tables

4.1 Simulation Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59

6.1 Distribution of used PlanetLab nodes around the world. .. . . . . . . . . . 946.2 List of the 26 F-root sites. The last column shows the percentage of Plan-

etLab nodes served by each F-root cluster. Example of an F-Root server isSFO2a.f-rootservers.net. . . . . . . . . . . . . . . . . . . . . . . . . . . .98

6.3 List of the 7 K-root sites. . . . . . . . . . . . . . . . . . . . . . . . . . .. 996.4 The list of the 8 UltraDNS clusters reachable from PlanetLab. . . . . . . . 1006.5 Statistics of DNS response times . . . . . . . . . . . . . . . . . . . .. . . 1046.6 Percentage of flips due to outages. . . . . . . . . . . . . . . . . . . .. . . 111

7.1 Notation used in the mathematical analysis of the shrew attack. . . . . . . . 1317.2 Aggregate link utilization from 20 TCP flows. . . . . . . . . . .. . . . . . 1377.3 Aggregate TCP link utilization for 250 flows. . . . . . . . . . .. . . . . . 139

xiii

List of Figures

2.1 The distribution of Storm bot IDs over the 128-bit hash space for (a) origi-nal Storm botnet (b) encrypted Storm botnet. The results in this figure arebased on data collected on 11/19/07. . . . . . . . . . . . . . . . . . . . .. 18

2.2 Population estimates of the botnets from 11/09/2007 - 1/29/2008 for the (a)Older storm botnet (b) Encrypted storm botnet. . . . . . . . . . . .. . . . 20

2.3 Top 15 countries in which peers are located percentage wise. The last barNA (Not Available) comprises of non-publicly routable IP addresses. (a)Older storm botnet (b) Encrypted storm botnet. . . . . . . . . . . .. . . . 21

2.4 (a) Distribution of IDs attributed to unusable IP addresses in the originalStorm network. (b) The distribution of IDs attributed to valid IP addressesin the same network. The x-axis represents the 128-bit hash space. . . . . . 22

2.5 Cumulative density function of the number of IDs associated with a sin-gle IP address and port. Unreachable/non-routable IP addresses were notincluded in this distribution. . . . . . . . . . . . . . . . . . . . . . . . .. 23

2.6 Correspondence between the number of IDs published by anIP address andoccurrence in spam black lists. . . . . . . . . . . . . . . . . . . . . . . . .25

3.1 Percentage of infected users as a function of time as predicted by the ana-lytical model and as demonstrated by simulation. . . . . . . . . .. . . . . 39

3.2 (a)Rate of domain infections as a function of time with the total mobilepopulation (b) Rate of infection with only 25% of the mobile nodes . . . . 41

3.3 The first time an infected node is seen at a network domain as a function ofthe domain’s popularity, defined as the number of cumulativenode-hoursoccupancy of a domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Detection time when monitors are deployed in the top x% ofthe domains. . 463.5 (a) Similarity between the popularity of the top 50 domains on a weekly

basis for 2004 (b) Median detection time if the monitors are deployed stat-ically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Worm evolution when the worm is inactive in the top 50 domains. . . . . . 49

xiv

LIST OF FIGURES

4.1 (a) Random moonwalk on a network with no malicious traffic. (b) Randommoonwalk on the same network when a worm is injected att ∼ 167 min.The y-axis represents the frequency with which flows starting at a particulartime, appear in the set of paths traversed by the moonwalks. .. . . . . . . 60

4.2 (a) Average moonwalk length for a network with no malicious traffic anda network in which a worm is injected att ∼ 420 min. Graphs are shownwhen 100% and 75% of the population is vulnerable (b) Percentage ofinfected nodes as a function of time for the same worm. . . . . . .. . . . . 61

4.3 Average moonwalk length for a network with different volumes of normaltraffic. The curve labelled ’High’ corresponds to double thevolume oftraffic in ’Norm’, while the curve labelled ’Low’ representsa scenario inwhich the traffic is halved. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 The scatter plot of walk length versus root node frequency. The square dotindicates the actual patient zero. . . . . . . . . . . . . . . . . . . . . .. . 68

4.5 Candidate infection trees reconstructed using a BFS search. The tree rootedat 5344 is the actual infection tree. All nodes in this tree were indeedinfected by the worm. The trees rooted at 1167 and 2148 are benign. Adirected edge between nodeX andY indicates thatX initiated at least oneflow to Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Percentage of mobiles nodes that need to be inspected forsigns of infectionas a function of the normal traffic intensity. . . . . . . . . . . . . .. . . . 71

5.1 The proxy extension overlayed on top of a simplified javascript call graph. . 83

6.1 Sample Anycast configuration . . . . . . . . . . . . . . . . . . . . . . .. 906.2 Histogram of correspondence between TLD1 vs TLD2 clusters contacted

by PlanetLab nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.3 Response time CDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.4 Percentage of unanswered queries by various servers. . .. . . . . . . . . . 1066.5 CDF of outage duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.6 CDF of inter-outage duration . . . . . . . . . . . . . . . . . . . . . . .. . 1086.7 Number of outages observed by various servers. . . . . . . . .. . . . . . . 1096.8 Number of flips observed as percentage of the total numberof queries sent

to each nameserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.9 Period of time that PlanetLab nodes query the same serverfor the moni-

tored servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.10 CDF of the cluster stability of F-Root and K-root. . . . . .. . . . . . . . . 1136.11 Correlation of outages and flips for the F-root server. Asimilar correlation

was observed for the K-root server. . . . . . . . . . . . . . . . . . . . . .1146.12 Additional round trip time for client queries to the anycast-selected F-root

and TLD2 servers over the closest servers. . . . . . . . . . . . . . . .. . . 116

xv

LIST OF FIGURES

6.13 Additional distance over the optimal traveled by anycast queries to contacttheir F-root, K-root, TLD1 and TLD2 server. . . . . . . . . . . . . . .. . 117

6.14 Variation of server load with varying server advertisement radius for a ran-dom distribution of 200 clients. Redundancy is denoted by R.. . . . . . . . 122

6.15 Variation of average AS path length with change in the radii of the serverfor a random distribution of 200 clients. . . . . . . . . . . . . . . . .. . . 123

7.1 Square-wave shrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307.2 Effect of a single shrew on TCP throughput as a function ofthe RTT of

flows sharing a DropTail queue. . . . . . . . . . . . . . . . . . . . . . . . 1307.3 Dumb-bell configuration. . . . . . . . . . . . . . . . . . . . . . . . . . .. 1347.4 TCP throughput as a function of the RTT under increasing buffer sizes.

Unless otherwise specified,R = 40 msec. . . . . . . . . . . . . . . . . . . 1367.5 WhenR increases to 120 msec, it is possible to have small buffer size

(m = 2) without penalizing the TCP flows sharing the link with the shrew. . 1377.6 TCP throughput under increasing buffer sizes. . . . . . . . .. . . . . . . . 1407.7 Peak and Average Shrew sending rate needed to maintain reduced link uti-

lization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

xvi

Dedication

To all those authors, who begin with

“The Internet has witnessed an explosive growth”.

xvii

Chapter 1

Introduction

The Internet has witnessed an explosive growth over the pastdecade. Such growth has

meant that, the Internet is now ubiquitous and used by a largeset of entities, with diverse

and possibly conflicting interests. As a result of this increased global scope, security is es-

sential for any Internet enabled system. Over the years, theform and character of security

threats to network users and the network itself, have evolved significantly. The motive of

the attacker has also seen a decisive shift from fun towards profit. Furthermore, attack-

ers have become increasingly sophisticated, exploiting vulnerabilities in multiple layers of

existing technologies as well as those in emerging, less mature technologies.

1

CHAPTER 1. INTRODUCTION

1.1 Motivation and a Brief Chronology

Early worms (e.g. Code Red I, 2001 [1]), used naive a random scanning approaches to

infect new victims. However, they could be easily detected owing, to the large amounts of

noisy scans generated as a side-effect. Evidence from recent malware, such as the Agobot

virus [2], have seen attackers increasingly employ measures to avoid detection and mon-

itoring infrastructures. Furthermore, the phenomenon of botnets is now commonplace,

whereby the infected machines (bots) are unwittingly drafted into a network, called the

botnet [3]. A botnet can then be engineered to carry out a hostof secondary malicious ac-

tivity ranging from spamming, denial-of-service (DOS) attacks or phishing attacks etc. The

controllers of this network (a.k.a botmasters), communicate with the bots using a command

& control (C &C) channel, typically IRC [4]. While IRC channels serve as the predominant

method of C&C channels even today, new botnets are emerging which use a decentralized

communication mechanism, e.g. P2P, for reasons of increased robustness [5]. Finally,

the delivery mechanisms for malware have been increasing insophistication and number.

Targeted attacks either using fingerprinted web browsers and operating systems are more

commonplace. Botnet sizes typically run from the tens of thousands to millions of bots.

The impact of a DoS attack from such botnets can cause great collateral damage to orga-

nizations and the infrastructure of the Internet itself. There have been numerous instances

of such events in the past. For example, in 2007, an attack on the DNS root servers, nearly

took down three root zones [6]. Consequently, it is essential to the quantify the ability of

the Internet infrastructure to withstand such large scale DoS attacks.

2


1.2 Thesis Contribution and Outline

As the brief chronology in the previous paragraph suggests,the malware ecosystem is

constantly evolving. In accordance, this thesis follows a multifaceted approach towards

addressing some of the security issues facing the Internet.First, measurement studies are

performed to experimentally quantify the ground truth related to the security threats in the

wild. Specifically, we study the Storm Worm [7] which represents the leading edge in

botnet technology. Since security research is essentiallyan arms race, we develop counter-

measures for existing threats and also look ahead to threatslikely to occur in the future. We

develop novel abstractions to counter web browser vulnerabilities, e.g. cross site scripting

(XSS) [8]. These abstractions replace the antiquated browser security policies and can be

used to securely sandbox web content, while allowing controlled sharing. Advancement

in communications technology presents new avenues for malware propagation. One such

emerging phenomenon, is mobility. As mobile devices becomepervasive and more power-

ful, malware can exploit mobility as a vector for propagation. We analyze this phenomenon

and devise a novel technique to detect and contain the spreadof a mobile worm.

While the above research deals mainly with end-users, at theother end of the spectrum

is the core of the Internet. Given, the size of botnets can runinto millions and thus, their

potentially lethal DDoS capabilities, we study if the Internet is engineered for robustness.

Specifically, we study whether the DNS fabric of the Internetis robust in terms of availabil-

ity and resiliency. Finally, we conduct a short experiment on router buffer sizes and their

effect on DoS attacks. Of late, there has been a renewed interest towards reducing the size

3


of buffers in Internet routers. We study if this reduction would enhance lethality of DoS

attacks by makeing it easier for DoS attacks to camouflage themselves as normal traffic.

Specifically, we look at shrew attacks, which are low volume DoS attacks. Briefly then, we

present a description of the upcoming chapters and their corresponding contributions.

1.2.1 Tracking P2P Botnets

P2P botnets, which use DHTs for their C&C channels, are a relatively new entrant in

the botnet ecosystem. We track and study one such botnet, thestorm worm. The Storm bot-

net, also known asTrojan.Peacomm [9,10], made its first appearance in January 2007.

Storm is notable in its use of Kademlia [11], to coordinate the infected hosts and the use

of fast-flux DNS services to distribute binary updates [12].Moreover, Storm aggressively

defends itself by resisting reverse engineering attempts and executing DDoS attacks against

external hosts that attempt to probe its operations [13]. Due to these aggressive mechanisms

and its distributed nature, little is known about the network of Storm-infected hosts. We

developed a crawler based on the Overnet protocol, and used it to crawl the Storm DHT.

Using this crawler, we estimate that approximately 300,000end-hosts were members of

the Storm botnet during November 2007. Perhaps more important than the size estimates,

are the anomalies we discovered during this process. First,unlike traditional DHTs, the

distribution of keys stored in the DHT is not uniform over thehash space as in other P2P

systems. Furthermore, we found a small percentage of nodes that publish an abnormally

large number of IDs. We provide evidence that these findings are the side-effects of actions

4


by entities external to the Storm network, meant to track andinterfere with its operations.

Unfortunately, the fact that we were able to detect these activities also suggests that sophis-

ticated botmasters can also discover the “good” nodes in thebotnet (i.e., those that monitor

its operation and do not participate in malicious activities). In essence, this study shows

weakness in current botnet monitoring technology and serves as a call to arms for the de-

velopment of more stealthy methods to monitor P2P botnets. Chapter 2 of this thesis deals

with tracking P2P botnets.

1.2.2 Using Mobility to Propagate Malware

New advancements in technology, provide new avenues for malware propagation. One

such development, which has become a pervasive feature among computing devices is the

ability to remain connected while being mobile. According to industry reports, 812 million

mobile terminals were sold in 2005 and the sale of new devicesare expected to top 1

billion in 2008 [14]. Looking forward, we expect widespreadadoption of technologies

such as WiMax, mesh networks and even vehicular wireless networks. This increase in

connectivity however comes at a high price - failure to properly secure these media will

provide new avenues for malicious behaviour. As a matter of fact, the exploitation of these

media is not just our speculation: variants of the Zotob/Mytob worm are suspected to have

used “physical” transfer as a propagation strategy [15]. More recently, a series of malware

that attempt to exploit Bluetooth connections as a medium for spreading were reported in

the media.

5


To better understand this impending threat, we investigatehow mobility can be ex-

ploited across a large number of end-hosts. Using an analytical model, we estimate the

speed with which such mobile contagion can propagate over a population of nomadic users.

We validate our results using realistic mobility traces from a campus wide wireless network

deployment with hundreds of access points and thousands of mobile users. We show that,

while the speed of propagation of mobile malware is slower when compared to traditional

Internet worms, it is still fast enough to render manual countermeasures implausible. Fur-

thermore, given this sort of mobile contagion, we devise a novel technique usingrandom

moonwalksto provide early detection of such mobile malware. We show that the proposed

mechanism can reliably the spread of a mobile worm in the early stages of the evolu-

tion. We also devise techniques to conduct post-mortem forensic analysis of an infection,

whereby the originator of an infection (patient zero) can be identified with reasonable accu-

racy. Chapter 3 deals with the modelling of the evolution andunderstanding the properties

of mobile worms. The detection and forensic methods for mobile contagion is presented in

Chapter 4.

1.2.3 Isolation and Sharing in the Web Browser

Web browsers have relied on the Same Origin Policy [16] (SOP)to dictate trust rela-

tionships between content loaded from websites. Content includes both data and code, in

the form of scripts. The SOP prevents a document or script loaded from one site of ori-

gin from manipulating the properties of or communicating with a document loaded from

6


another site of origin. However, this policy provides absolutely no granularity and as we

shall discuss, its all-or-nothing approach is the source ofmany browser vulnerabilities. For

example, the cross site scripting attack (XSS) is a confuseddeputy problem which exploits

the browser’s trust in executing all the scripts presented on a webpage. We present two ab-

stractions to address this issue. First, we develop an isolation abstraction which isolates all

unauthorized content. Second, we develop a sharing abstraction, using enables controlled

communication between entities. In Chapter 5 we show that the abstractions presented

above, attacks such as XSS can be countered.

1.2.4 Anycast in DNS

While the previous topics mainly address issues with security of end-hosts, the other

key component in dealing with robustness of the Internet is the study of its core com-

ponents. The DNS is one such integral component and is used toresolve names to IP

addresses. Anycast, is a routing mechanism which is used in DNS zones [17]. We evaluate

the reliability and resiliency of anycast. In this study, weuse results from four top-level

DNS servers to evaluate whether anycast indeed improves DNSservice and compare dif-

ferent anycast configurations. Increased availability is one of the supposed advantages of

anycast and we found that indeed the number of observed outages was smaller for anycast,

suggesting that it provides a mostly stable service. On the other hand, outages can last up

to multiple minutes, mainly due to slow BGP convergence [18]. We also found that anycast

indeed reduces query latency. Furthermore, depending on the anycast configuration used,

7


37% to 80% of the queries are directed to the closest anycast instance. Our measurements

revealed an inherent trade-off between increasing the percentage of queries answered by

the closest server and the stability of the DNS zone, measured by the number of query

failures and server switches. Chapter 6 presents the results of our study on anycast.

1.2.5 Router Buffer and Robustness

Of late, there has been a renewed interest in reducing routerbuffer sizes. This research

is driven by increasing bandwidth speeds pushing up the costof expensive buffer memory

and the power consumption of memory chips. Router queues buffer packets during conges-

tion epochs. A recent result by Appenzeller et al. [19] showed that the size of FIFO queues

can be reduced considerably without sacrificing utilization. While Appenzeller showed that

the link utilization is not affected, the impact of this reduction on other aspects of queue

management such as fairness, is unclear. We investigate whether the reduction of buffer

size renders DoS attacks more effective. While brute force DoS attacks, can be easily de-

tected and contained, low-rate DoS attacks, called shrews can throttle TCP connections by

causing periodic packet drops [20]. Unfortunately, smaller buffer sizes make shrew attacks

more effective and harder to detect since shrews need to overflow a smaller buffer to cause

drops. We show that a relatively small increase in the buffersize over the value proposed by

Appenzeller is sufficient to render the shrew attack ineffective. Intuitively, bigger buffers

require the shrews to transmit at much higher rates to fill therouter queue. However, by

doing so, shrews are no longer low-rate attacks and can be detected by Active Queue Man-

8


agement (AQM) techniques such as RED-PD [21]. The results from this experiment are

presented in chapter 7.

9

Chapter 2

On Tracking P2P Botnets

Botnets, networks of compromised machines under the control of botmasters, represent

a significant threat to the Internet today. While traditionally using centralized command

and control (C&C) architectures (e.g., IRC servers) to control their bots, botmasters have

recently begun to employ P2P protocols for these tasks [4]. P2P architectures avoid the

single points of failure inherent to IRC-based botnets, thus rendering them less vulnerable

to takedownattacks, that target the C&C servers.

The Storm botnet, also known asTrojan.Peacomm [9, 10], is one such P2P botnet

which made its first appearance in January 2007. Storm is notable in its use of a DHT pro-

tocol, Kademlia [11], to coordinate the infected hosts and the use of fast-flux DNS services

to distribute binary updates [12]. Storm aggressively defends itself by resisting reverse en-

gineering attempts and executing DDoS attacks against external hosts that attempt to probe

its operations [13]. Furthermore, even an encrypted version of Storm emerged in the later

10

CHAPTER 2. ON TRACKING P2P BOTNETS

half of 2007 which used the XOR operation to encrypt its P2P network traffic. Due to such

aggressive defense mechanisms and its distributed nature,little is known about the network

of Storm-infected end-hosts.

Using these two versions of the Storm binary as examples of P2P botnets, we present

estimates of their size and properties, using a custom made crawler. Furthermore, we show

that current techniques for tracking P2P botnets lend themselves to counter-intelligence

employed by the botmasters. Such counter-intelligence canreveal the identities of the

botnet trackers and thus subject them to attacks by these powerful botnets.

Using the crawler we estimate that the population sizes of the older and encrypted ver-

sions storm botnets were around 300,000 and 30,000, respectively in November, 2007.

Perhaps, more important than the size estimates, are the anomalies unearthed during our

crawling process. Even though both versions use the underlying Kademlia-based protocol,

we discovered that the P2P keys stored in the older Storm botnet are not uniformly dis-

tributed over the hash space, as is typical of DHTs. This non-uniformity is primarily due

to keys which point to unreachable and non-routable IP addresses (e.g., private, multicast,

loopback, and unallocated IP addresses). While this irregularity is largely absent in the

latter botnet, we found other atypical artifacts common to both DHTs. For example, we

found a small percentage (< 1%) of routable IPs that publish thousands of IDs, while the

vast majority of Storm nodes publish only a small set of IDs.

We provide evidence that these findings are the side-effectsof actions by entities exter-

nal to the Storm network, meant to track and interfere with its operations. In that respect,

11


they represent practical applications of theindex poisoningattacks previously theorized

by a number of researchers [22, 23]. Unfortunately, the factthat we were able to detect

these activities also suggests that sophisticated botmasters can also discover the “good”

nodes in the botnet (i.e., those that monitor its operation and do not participate in malicious

activities). This capability can subsequently be used by the botmaster to launch a DDoS

attack against these nodes. Furthermore, even our crawler can be easily detected, due to

the abnormally high number of queries it generates. While distributing the tracking task

among multiple machines can alleviate the detection problem, we argue that doing so re-

quires a large distributed network of monitors due to the Storm network’s properties. Taken

as a whole, our results should serve as a call to arms for the development of more stealthy

methods to monitor P2P botnets.

The rest of the chapter is divided into six sections. The section that follows outlines

Storm’s P2P architecture, while Section 2.2 describes our measurement methodology. We

present our findings from these measurements in Section 2.3 and cover related work in

Section 2.4. Finally, we close in Section 2.5 with a summary and future research directions.

2.1 Background

We summarize the functionality of the Storm network. Given the focus of our work, we

outline Storm’s Command and Control (C&C) protocol rather than presenting its different

infection vectors and malicious activities. Readers interested in these aspects are directed

12


to [7,9,10].

2.1.1 Command and Control

Storm uses the Overnet protocol, which in turn is based on theKademlia Distributed

Hash Table (DHT) [11]. Each peer, as well as each object stored in an Overnet network, is

associated with a 128-bit identifier (ID). Peer identifiers are randomly generated using the

MD4 cryptographic hash function [24]. Routing in Overnet isbased on prefix matching,

whereby the distance between two IDs is equal to the XOR of thetwo identifiers. For

example, the distance betweena = 0001 andb = 1110 is d(a, b) = a⊕ b = 0001⊕1110 =

1111.

Overnet nodes organize their routing tables as lists ofk-buckets. Specifically, for each

0 ≤ i < 128, the correspondingk-bucket holds up tok(= 20) <IP address, UDP

port, Node ID> tuples corresponding to nodes whose distance from the current node

falls within the[2i, 2i+1) range. This routing table resembles an unbalanced routing tree in

which a node maintains only a few contacts to peers that are far away (i.e., corresponding

to large values ofi) and increasingly more contacts to nodes within shorter distance.

When an Overnet node receives any message (request or reply)from another node,

it updates the appropriatek-bucket with the sender’s node ID. Thek-buckets effectively

implement a least-recently seen eviction policy, except that live nodes are never removed

from the list. An Overnet node that receives a request for an ID, returns the tuples of the

k nodes it knows about that are closest to the requested ID. These tuples can come from a

13


singlek-bucket, or from multiplek-buckets if the closestk-bucket is not full. Routing then

proceeds iteratively, by querying each successive peer on the route to the destination.

When a new peer joins the network, it inserts itself into the contact list of other nodes

by performing a lookup for its own ID. Moreover, peers periodically query their own ID in

an effort to keep their own as well as their peers’ routing information fresh. When a peer

wants to store an object corresponding to the<ID, value> pair, it locates thek nodes

closest to the ID and asks them to store it. Finally, to limit the stale information in the

system, publisher and peers periodically publish the<ID, value> pairs that they have.

While Overnet suggests that nodes have persistent IDs, we observed that Storm-infected

hosts choose different IDs every time they reboot and also when their DHT searches fail.

Furthermore, the Storm binary has a hard-coded list of over 400 initial peers which it uses

to attach itself into the network. Considering the large percentage of end-hosts residing in

private address space, Overnet includes a special NAT discovery mechanism. Bots use this

mechanism to detect whether they reside behind a NAT device and if so to advertise their

globally visible IP address (rather than their private address) when they join the network.

In addition to performing periodic queries for their own IDs, Storm bots periodically

search for a set of keys stored in the Overnet network. According to Stewart [7], bots

generate these search keys through a built-in algorithm that uses the current date and a

random number uniformly selected from[0 . . . 31]. The values associated with those keys

contain an encrypted URL that the bots decrypt and retrieve using HTTP. We noticed that

nodes change their own IDs when key searches for these URLs fail. A bot will then rejoin

14


the DHT with its new ID and restart its search for the URL.

We note that because Storm uses the same Overnet protocol that popular file-sharing

networks use (e.g., eDonkey and eMule), non-infected hosts can be used to storekeys for

Storm. The botnet was probably designed this way to leveragethese P2P networks as a

bootstrap mechanism during its early stages [5].

2.1.2 Encrypted Protocol

An encrypted version of the Storm C&C protocol emerged in October 2007. While

based on the same Overnet protocol described above, the fieldtypes and values contained

in the messages exchanged using this version of the protocolare encrypted using the XOR

operation. For example, we observed that all Overnet message types were XOR’ed with

the0xAA key.

Given this weak encryption mechanism, we were able to discover the keys that Storm

uses by observing the traffic between a machine running the encrypted Storm binary and a

custom Overnet client that we developed. Specifically, we employed a setup, similar to a

virtual playground [25], in which we record and redirect allthe traffic generated by the bot

binary to the custom Overnet client. The client replies to the bot’s queries with pre-defined

responses.

We provide two examples of the methods used to retrieve the botnet’s keys. First, the

bot uses a.ini file which contains the list of peers it contacts to bootstrapitself to the

DHT. This list of peers is stored in the clear and is under our control. Then, by XOR-ing the

15


(encrypted) IP address included in an Overnet message with the IP address listed in the file,

one can extract the key used to encrypt IP addresses. Likewise, our Overnet client replies

to the bot’s search queries with known IP addresses. Then, byobserving the IP addresses

that the bot attempts to connect to after receiving the client’s response, we were able to

derive the keys used in search queries.

Using similar techniques we were able to retrieve all the keys necessary to traverse

Storm’s P2P network using the crawler presented in the section that follows.

2.2 Measurement Methodology

We leverage the Overnet protocol described above to discover properties of the Storm

infection. Specifically, we developed a crawler that queries the network for randomly gen-

erated keys and records the node IDs, IP addresses, and the port numbers that the peers

return. Whenever the crawler receives a new usable (i.e., routable) IP address it sends a

query for another random ID to the corresponding node. The query process continues until

the crawler’s rate of discovery of new IPs becomes effectively zero. We seed the crawler’s

search with a list of peer IPs we collected by executing an instance of the Storm binary

within a Qemu honeypot [26] running Windows XP. Through thisprocess, we gathered a

set of∼ 4, 000 initial IPs for the older Storm botnet and∼ 2, 200 IPs for the encrypted

version.

We perform two types of crawls: afull crawl and azone crawl. The IDs that the crawler

16


queries during a full crawl are selected from the entire 128-bit space. On the other hand,

the IDs queried during a zone crawl share the same prefix. For example the0x0A 8-bit

prefix, contains all 128-bit IDs whose most significant eightbits have value0x0A.

While full crawls provide a more complete view of the network, they are resource

intensive in terms of the network traffic they generate and the amount of storage required

for the results. Moreover, they require hours to complete during which time the network’s

membership might change. On the other hand, 8-bit zone crawls typically finish in 10

minutes and generate significantly fewer queries (the reduction is proportional to the size

of the zone queried).

2.3 Results

We present results derived from a measurement study conducted over a period of about

three months (11/09/2007 -1/29/2008), using the methodology described above.

2.3.1 Node ID distribution

As previously explained, full crawls are slow, resource intensive, and potentially inac-

curate. Therefore, we prefer to use zone crawls to estimate the number of peers within

these zones and extrapolate the results to the full population. However, in order to do so we

need to ensure that the measured zones are representative ofthe larger network. Given that

Storm node IDs are generated by a cryptographically secure hash function (i.e. MD4), they

17


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0

% o

f pee

rs

zones

0

0.5

1

1.5

2

2.5

0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0

% o

f pee

rs

zones

Figure 2.1: The distribution of Storm bot IDs over the 128-bit hash space for (a) origi-nal Storm botnet (b) encrypted Storm botnet. The results in this figure are based on datacollected on 11/19/07.

should be uniformly distributed over the 128-bit address space. Furthermore, results from

other Kademlia-based networks have experimentally verified the existence of this uniform

distribution [27]. Nonetheless, we conducted full crawls of both versions of the Storm net-

work (i.e., the one with no encryption and the latter one which uses the XOR encryption)

to verify this conjecture.

Figure 2.1 presents the results of two such full crawls performed on 11/19/07 (we ob-

served similar patterns during other dates). Specifically,the left chart presents the dis-

tribution of IDs in the original Storm network, while the chart to the right presents the

ID distribution for the network that uses encryption. Whilethe distribution of IDs in the

encrypted botnet is approximately uniform, IDs in the original network display marked

non-uniformities with recurring ramp-like structures that repeat at the beginning of each

3-bit zone. We defer the discussion about the underlying causes of this surprising non-

uniformity until Section 2.3.3. For now, we use this result to select the length of the zones

18


to crawl.

2.3.2 Population Estimates

For the older botnet, because the ID distribution has a regular pattern which recurs

in every 3-bit zone, we infer the size of the overall population by crawling 3-bit zones.

Specifically, we count the number of IPs whose correspondingIDs are within the zone

and then multiply this count by the total number of zones. We chose the two three-bit

zones: 0x00/3 and 0x80/3. We selected these two zones because in Figure 2.1, the 0x00/3

zone accounts for the highest percentage of Storm peers, while the 0x80/3 represents the

average case. For the encrypted version too, we followed a similar procedure. Figure 2.2

presents the population estimates we derived by crawling both the botnets. We see that the

average population of the botnet is about 400,000 for the older botnet and about 50,000 for

the encrypted version. These values are in accordance with estimates from Microsoft (∼

500,000) [28] for the older botnet.

We further categorized the end-hosts discovered during thezone crawls based on their

country of origin. To do so, we utilized the maxmind database[29] to map IP addresses

to countries. Figure 2.3 presents the 15 countries with the highest number of infected

nodes. For the older botnet - the continent with the highest percentage of infected hosts

was North America, with the United States contributing approximately 30% of the peers.

This population distribution deviates from the one observed for the popular KAD P2P file

sharing network, which is also based on Kademlia [27]. For the encrypted version of the

19


0

50000

100000

150000

200000

250000

300000

350000

400000

450000

11/20/07 12/08/07 12/26/07 01/13/08 01/31/08

Pop

ulat

ion

Date

0

10000

20000

30000

40000

50000

60000

70000

11/20/07 12/08/07 12/26/07 01/13/08 01/31/08

Pop

ulat

ion

Date

Figure 2.2: Population estimates of the botnets from 11/09/2007 - 1/29/2008 for the (a)Older storm botnet (b) Encrypted storm botnet.

botnet, the proportion was more uniform with approximately10% of the peers in North

America. Furthermore, the Figure 2.3 also indicates that inboth cases<1% of IP addresses

we encountered could not be resolved by the database (listedasNA). This set consisted of

private, multicast, loopback, and unallocated/reserved IP addresses. The existence of such

addresses was unexpected because Storm uses an IP query-response mechanism to traverse

NAT boxes and it cannot communicate with unreachable/private IP addresses. Nonetheless,

we do not include these addresses in the population estimates presented above.

2.3.3 Relationship between peer addresses and identifiers

The unexpected non-uniformity of the ID distribution lead us to investigate the IDs

associated with each IP address. We found that approximately 1% of IP addresses we

encountered in the original Storm botnet, consisted of private, multicast, loopback, and

unallocated/reserved IP addresses. The existence of such addresses is unexpected because

Storm, as described in Section 2.1.1, uses an IP query-response mechanism to detect the

20


0

5

10

15

20

25

30

US MX RU DE AR IN TR CA PE FR PL BR KR GB ES NA

% o

f inf

ecte

d no

des

country

0

2

4

6

8

10

12

14

US TR RU PL MX EG MA KR AR DE PE IN CN RO BR NA

% o

f inf

ecte

d no

des

country

Figure 2.3: Top 15 countries in which peers are located percentage wise. The last barNA(Not Available) comprises of non-publicly routable IP addresses. (a) Older storm botnet(b) Encrypted storm botnet.

existence of NAT gateways and thus does not normally announce unreachable/private IP

addresses.

To our surprise, we found that this group of IP addresses is associated with 45% of the

unique peer IDs we recorded, even though it accounts for<1% of the total population of

nodes. In other words, 45% of the<IP address, UDP port, Node ID> tuples

stored in the Storm network point tounusableIP addresses. Furthermore, as Figure 2.4.(a)

illustrates, the IDs associated with these IP addresses arethe main contributors to the non-

uniformity shown in Figure 2.1.(a). As a matter of fact, the overall identifier distribution

becomes considerably more uniform after removing these IDs(see Fig. 2.4.(b)).

The most plausible explanation for the existence of these IDs is that some peers are

deliberately injecting ’bogus’ identifiers that point to unusable address in an attempt to

poison the DHT. First studied by Lianget al. [23], index poisoning refers to the process

of inserting a massive number of invalid records into a DHT’sindex, in an attempt to slow

21


0

0.5

1

1.5

2

2.5

0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0

% o

f pee

rs

zones

0

0.5

1

1.5

2

2.5

0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0

% o

f pee

rs

zones

Figure 2.4: (a) Distribution of IDs attributed to unusable IP addresses in the original Stormnetwork. (b) The distribution of IDs attributed to valid IP addresses in the same network.The x-axis represents the 128-bit hash space.

down lookups. In fact, other researchers have previously speculated that index poisoning

could be an effective attack against the Storm botnet [22].

While the ramp-like structure is largely absent from the encrypted Storm botnet, we

discovered other anomalous patterns common to both botnetswhich indicate possible in-

stances of trackers/monitors. Specifically, we investigated whether IP addresses were as-

sociated with multiple IDs. As Figure 2.5 illustrates, for the original Storm botnet approx-

imately 85% of the addresses are associated with a single ID.In the case of the encrypted

version, 85% of the addresses are associated with<10 IDs. However, in both cases a very

small percentage of IP addresses (< 1%) are associated with a large number of IDs, some

of them in the thousands.

We note that while multiple infected hosts behind the same NAT device would advertise

the same IP address, they would use different ports and therefore do not contribute to the

phenomenon shown in Figure 2.5. Moreover, while we noticed particular periods during

22


0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000

CD

F

Number of IDs publicized

Unencryptedencrypted

Figure 2.5: Cumulative density function of the number of IDsassociated with a singleIP address and port. Unreachable/non-routable IP addresses were not included in thisdistribution.

which Storm specimens published different IDs every 10 minutes (typically when their

searches failed), that behavior cannot account for the verylarge number of IDs shown

in Figure 2.5. The reason is that stale IDs are removed after 24 hours from the Overnet

DHT, while consecutive crawls executed 24 hours apart consistently registered the same IP

addresses publishing large number of IDs. We are thus left with the only explanation that

some nodes deliberately inject multiple IDs in an effort to interfere or to monitor the Storm

network.

2.3.4 Data from Spam Block Lists

To buttress our claim that nodes publishing a large number ofIDs are not legitimate

Storm nodes, we use an out of band mechanism to verify whethera given IP address truly

hosts an instance of Storm. We leverage the fact that Storm sends gigantic amounts of

unsolicited emails [30]. Therefore, one would expect that active Storm nodes will be listed

23


in spammer blacklists. This intuition was verified by Ramachandranet al., who found

that presence in a blacklist is reasonably accurate in predicting whether a host is sending

unsolicited email [31]. Given this result, we use the simpleheuristic of classifying the

IP addresses found in a zone crawl that are also listed in spammer black lists1 as infected

hosts. We then plot the percentage of blacklist-identified hosts versus the number of IDs

associated with each host.

As Figure 2.6 suggests, the percentage of blacklisted nodesdecreases with the number

of IDs the nodes publish. This trend suggests that nodes which publish numerous IDs

do not seem to participate in the malicious activities of thebotnet and therefore are not

likely to be infected hosts. Further confirmation of the conjecture that the IPs advertising

a large number of IDs actually host trackers, is the observation that these IPs belong to

organizations which have traditionally been involved in security research. In fact, some of

these organizations have even published media reports regarding the Storm phenomenon.

Finally, a few of the nodes associated with large numbers of IDs, explicitly published their

IDs even to our crawler. This is despite the fact that our crawler does not insert itself in

the P2P network, but merely queries it. Legitimate Storm nodes we tested against, did not

contact our crawler with such messages.

1We use a number of popular DNS blacklists (CBL [32], TQMCUBE [33] and UCEPROTECT [34]). Weallow two to three days for the collected IP addresses to appear in these blacklists.

24


0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000

Bla

cklis

t occ

urre

nce

freq

uenc

y

Number of IDs publicized

Figure 2.6: Correspondence between the number of IDs published by an IP address andoccurrence in spam black lists.

2.3.5 Discussion

In the previous section, we have shown that it is simple to identify Storm track-

ers/monitors by exploiting the relation between node IDs and the IP addresses they publish.

By polluting the DHT index with a large number of IDs, the trackers can redirect P2P

queries to themselves. Doing so, allows them to answer queries with fake results, thus

interfering with the botnet’s operation. Furthermore, thesame mechanism can be used

to measure the size of the botnet. For example, the original Storm bots co-existed with

non-infected hosts using the same Overnet network. Since every Storm bot queries for a

known set of search keys every day, monitoring this key-space can provide an estimate of

the actual number of Storm bots. Alternatively, one could track P2P botnets by actively

crawling the DHT, as we have done in this paper. However, the nodes that perform these

crawls can also be detected due to the anomalously large numbers of queries they generate.

The unfortunate implication of these results is that botmasters will eventually notice

the existence of nodes interfering with their networks, probably using techniques similar

25


to the ones presented in this paper. They can then launch attacks against the nodes which

appear to monitor/poison the botnet. Such a counter-attackis not far-fetched considering

that the Storm worm is believed to have launched similar aggressive retaliatory attacks in

the past [13].

One potential solution to this predicament is to use multiple trackers, thereby reduc-

ing the number of IDs advertised by each tracker. As Figure 2.5 suggests, increasing the

number of trackers by a factor of∼50, would reduce the suspicion on a particular tracker.

In the case of nodes that crawl the DHT, we observed that on theaverage, only∼10% of

the Storm nodes replied to our queries. This could be becausethe rest were behind NAT

gateways or firewalls. For this reason our crawler queries the reachable nodes multiple

times during a single crawl. We found that on average a globally-reachable Storm node

receives 23 times more queries while a crawl is in progress, compared to its normal query

load. This wide difference in query load can then be used to notify a botmaster that a scan

is in progress and identify the set of nodes participating inthis scan.

The scale factors presented suggest using multiple co-operative trackers/crawlers could

alleviate the problem of monitor identification. An in-depth study of these techniques is

deferred for future work.

26


2.4 Related Work

The majority of botnets today use the Internet Relay Protocol (IRC) to disseminate

commands to individual bots. IRC’s centralized architecture allows snooping of the C&C

channels, thereby potentially revealing botnet membership and commands passed onto the

bot armies [4]. Due to these shortcomings, botmasters have innovated by migrating to C&C

protocols that are harder to detect and infiltrate.

The Storm worm is a prime example of this evolution in the botnet ecosystem, due

to its DHT-based C&C protocol and aggressive defensive capabilities. Consequently, it

has been the subject of multiple research reports, most of them focusing on the analysis

of the Peacomm binary. Grizzardet al. present a case study of Peacomm, by running

a single specimen of the infection in a contained honeypot environment [22]. Detailed

descriptions of the multiple techniques that Storm uses to disguise itself are provided in [7,

9, 10]. Instead, our work presents an analysis of Storm’s DHTprotocol and is the first to

discover traces of what seems to be widespread attempts to poison Storm’s C&C network.

A number of recent measurement studies have focused on the widely deployed KAD

P2P network –another variant of the Kademlia DHT that uses a slightly different routing

table. Stutzbachet al. use a distributed crawler to study the ID lookup performancein

KAD [35]. Steineret al. crawl the KAD network to estimate the lifetime of peer sessions

in this network [27]. The goal of our work is not to measure thelifetime of the infected

hosts, but to measure the actual distribution of node IDs andto provide insights into the

Storm network. Furthermore, we have discovered important differences between the Storm

27


and the KAD network, such as the non-uniform distribution ofpeer IDs.

Monitoring the evolution of the Storm botnet has been the subject of multiple recent

research reports [36, 37]. While the crawler described in this paper can be used for the

same purposes, the focus of our work is to investigate whether the monitors themselves can

be externally identified.

2.5 Summary

We present results from a measurement study of the Storm botnet which uses a decen-

tralized P2P infrastructure to coordinate individual bots. Our study revealed unexpected

artifacts in the distribution of node IDs which suggest the existence of external entities

aiming to track/monitor the Storm network.

Specifically, we witnessed widespread attempts to poison Storm’s Overnet network by

injecting invalid IDs that point to unreachable IP addresses. Moreover, we found that a

small number of routable IP addresses inject a large number of IDs, most likely in an

attempt to monitor or interfere with the Storm network. While polluting the DHT index is

an effective strategy to deter file sharing networks, as users have to manually sift through

bad search results, its effectiveness in stopping a botnet’s operation is questionable. A

study of the effectiveness of such poisoning techniques in curtailing botnets is an avenue

for future work.

More importantly, trackers that inject IDs or crawl a botnet’s network are easily identi-

28


fiable and thus vulnerable to counter attacks by the botnet’soperators. Therefore, there is a

critical need to develop effective P2P tracking technologies which can evade detection by

miscreants.

Acknowledgements

We would like to thank Razvan Musaloiu-E., for his help in deploying the Storm botnet

crawler.

29

Chapter 3

On Using Mobility to Propagate

Malware

Mobility pervades networked devices today. For example, millions of users access

the Internet through laptops and PDAs equipped with WiFi cards connected to thousands

of Access Points (APs) located on campuses, coffee shops, airports, etc. This increase

in connectivity however comes at a high price – failure to secure these communication

channels provides a new propagation vector for spreading self-replicating malicious code.

As a matter of fact, the exploitation of these channels is notjust our speculation: variants

of the Zotob/Mytob worm are suspected to have used physical movement of computers

across network domain boundaries as a propagation vector [15]. More recently, a series

of malware that attempt to exploit Bluetooth connections asan infection mechanism were

reported in the media [38]. The accepted practice of protection such worms today, is to

30

CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE

place mobile nodes in a de-militarized zone (DMZ), separatefrom the rest of the network.

In such a scenario, all communication between the mobile nodes and the wired nodes passes

through a firewall. However, mobile nodes can still infect each other through contacts

within these de-militarized zones.

Unfortunately, modelling efforts have not followed the pace of malware evolution as

most previous work describes how infections spread over wired networks. To better under-

stand this impending threat, we develop a concise analytical model that predicts the speed

of infections over populations of nomadic users traversinga collection of network access

points. The accuracy of the model is validated through simulations driven by realistic mo-

bility models, drawn from university-wide traces at Dartmouth College [39]. We found

that, in networks with thousands of users and hundreds of APsthe infection can reach 65%

of the total population within only one day, a relatively short time considering that infec-

tions follow the slow pace of node movements across network domains. Furthermore, if

mobile nodes are allowed to infect co-located nodes connected to the wired network, a sce-

nario modelling imperfect DMZs, we observed that even a small proportion of vulnerable

mobile nodes can propagate the infection to the majority of the network domains within a

single day.

Due to the high propagation speed of these worms, human defense mechanisms are

rendered implausible. Moreover, the threat from this classof infections stems from the

fact that mobile nodes trivially bypass existing perimeterdefenses, such as firewalls. Since

cross-domain transfer of the infection is accomplished by the physical migration of infected

31


nodes, it is difficult to contain them, when no controls existto police the movement of nodes

across domains. Such gaps in network defenses can lead to global worm outbreaks. Finally,

the detection of these worms is challenging due to their stealthiness. This characteristic is

a consequence of the fact that the majority of current detection techniques rely on traffic

anomalies measured at network monitors (network telescopes[40]). Unfortunately since

mobile infections scan within the domains of infected nodes, suspicious probes on tele-

scopes deployed at remote domains would be absent. This observation motivates the need

for developing novel malware containment technologies. One promising direction towards

this goal involves exploiting the spatial characteristicsof the infection. Specifically, we ob-

served that by placing monitors in approximately 10% of the most visited domains, we can

detect the mobile worm before it reaches the majority of the population. While this seems

a straightforward solution to the early detection problem,we argue that monitor placement

is still a challenging problem with many intricacies.

The rest of the chapter is structured as follows: Section 3.1introduces the model for

predicting the spread of infections among a population of mobile users. We compare the

model’s predictions to simulation results driven by realistic mobility traces in Section 3.2

where we also investigate a number of variants of this worm. In Section 3.3 we compare the

mobile worm to atraditional (i.e. globally scanning) worm and provide intuition about the

temporal evolution of the infection by connecting it to the structure of the mobility graph

in Section 3.4. We discuss the issues involved with telescope placement in Section 3.5.

Section 3.6 presents previous models for malware and mobility patterns and finally we

32


close in Section 3.7 with future research directions.

3.1 Worm Model

We model infections spreading over collections of mobile users who connect to the

Internet through a revolving set of network access points. This model consists of two

types of entities:(a) network domains through which users connect to the Internetand(b)

mobile nodes,e.g., laptops and PDAs, that are susceptible to infections and move across

these domains. In this context, domains act as mixing regions in which mobile nodes can

reach each other. We assume that an infected mobile node can infect another susceptible

mobile node if they reside in the same domain, even for a shortperiod of time. This is a

realistic assumption because an infected mobile node can eavesdrop on communications

from all the other wireless nodes in the same domain and attempt to infect them directly.

The evolution of an infection can be modelled as a discrete time, replication process

over the setV of vulnerable nodes. We denote the probability that nodei is infected at time

stept by pi,t. Furthermore, letβij be the probability that nodei contacts nodej. Given

these conditions, nodei is not infected at time stept iff it was not infected by time step

t−1 and no infected nodes in the domain it resides in, contacted nodei during the last time

step. Because these events are independent, this probability can be expressed as:

33


1− pi,t = (1− pi,t−1)∏

j 6=i

(1− βjipj,t−1)

1− pi,t = 1− pi,t−1 −∑

j

βjipj,t−1

Here, we use the approximation(1− a)(1− b) ≈ 1− a− b whena≪ 1, b≪ 1. Thus

we have,

pi,t ≈ pi,t−1 +∑

j

βjipj,t−1 (3.1)

By representing (p1,t, p2,t, . . .) as a row vectorPt and assigningβii = 1 (i.e., the proba-

bility that a nodei contacts itself is trivially one), we can rewrite Equation (3.1) in a matrix

form as:

Pt = Pt−1M (3.2)

whereM=[βij ] is the system matrix, containing the pairwise contact probabilities. From

the definition ofPt, pi,t is the probability that nodei is infected at timet. Therefore, the

expected number of infected nodes after timet is given by

E[

|I|]

=

|V|∑

i=1

pi,t = ||Pt||1 (3.3)

whereI is the set of all infected nodes. This type of matrix multiplication view of an

infection is common in epidemic modelling (e.g.,[41]).

34


We initiate the infection by infecting a single node, sayk. The initial conditions are

then as follows:

pi,0 =

1 if i = k,

0 Otherwise

If multiple nodes are initially infected (also known as patient zeros), the corresponding

indices inP0 are set to unity.

3.1.1 Mobility Model

It is evident that in order to estimate the expected number ofinfected nodes in Equa-

tion (3.3) we need to calculate the contact probabilitiesβij . In turn, these probabilities

depend on the number of domains a node visits and the durationof time that the node re-

sides in each domain. We therefore need a mobility model thatdescribes the movement of

mobile nodes across network domains.

We model the mobility pattern of individual nodes using semi-Markov chains. We

chose the more general semi-Markov model because it was shown that node residence

times do not follow the exponential distribution [42,43], but are better modelled by heavy-

tailed distributions. The state spaceS = {1, · · · ,m} of the homogeneous semi-Markov

chain is the set of all network domains. The transition matrix P describing the chain is then

anm×m matrix, whileD = [di] is anm× 1 vector, which gives the mean residence time

of the node in each domain.

35


We can then derive the steady-state transition probabilitydistributionπ by solving the

following set of equations:

π = πP

m∑

i=1

πi = 1

Given the fraction of timeπ that the user stays in each state and the mean residence

times D for each state, it is easy to calculate the steady-state probability πi of the user

staying in domaini:

πi =diπi

∑m

j=1djπj

(3.4)

From Equation (3.4) we can subsequently compute the contactrateβxy between nodes

x andy. This value is equal to the probability that bothx andy are in the same domain at

some point in time. Without loss of generality, we say that when a node is in the “OFF” state

(i.e. it is not operational) then it resides in the domain with index 1. Since, the infection

does not propagate when nodes are not connected, we do not include the percentage of time

in the “OFF” state in the calculation of the contact rates. The contact rates are then given

by:

βxy =

m∑

i=2

πxi π

yi (3.5)

36


whereπxi is the percentage of the time spent byx in domaini. We substitute Equation (3.5)

into Equation (3.2) to obtain the number of infectees as a function of time.

The last complication is that Equation (3.2) proceeds on discrete time steps of uniform

duration, while nodes actually have variable domain staying times. We address this dis-

crepancy by using the mean residence time across all domainsas the discrete time step in

Equation (3.2). While doing so compromises the accuracy of the analytical model, as the

simulation results from Section 3.2 demonstrate, even withthis compromise the model is

able to accurately track the infection’s evolution.

3.2 Evaluation

We derive the parameters of the mobility model described in the previous section from

traces of actual mobile user behaviors, available from Dartmouth college [39]. Each trace

is a time sequence of the access points the mobile users visit(identified by their MAC

addresses). Traces also contain the special ’OFF’ location, signifying a user’s departure

from the network. The trace we use contains 626 different access points and tracks the

movement of mobile users from 9/23/2003 to 12/10/2003. Approximately 6% of the users

in our trace visited just a single domain before entering the“OFF” state. We removed

such users, since states in their semi-Markov chain are not recurrent and their steady state

probabilities in states other than the “OFF” state are trivially zero. In all, we had 6101 users.

We assume that all the mobile users in the system are vulnerable. We observed similar

37


infection curves when only a fraction of the mobile users were vulnerable. Furthermore,

the infection model can easily incorporate scenarios in which only a subset of the mobile

nodes are vulnerable by appropriately defining the set of vulnerable nodes,V . The mean

domain residence time of the users is approximately 67 minutes. We use this value as the

discrete time step in Equation (3.2).

3.2.1 Mobile node infection

We compare the model’s predictions with results provided bydetailed simulations. The

custom simulator we developed emulates the movements of mobile users over the same

collection of APs and tracks the evolution of the infection after an initial node (Patient

zero) is infected. As before, we assume that the infection passes from an infected node to

any other node that resides in the same network at the same time. We ran 100 simulations,

each time randomly choosing a different initial node to infect.

Figure 3.1 graphs the evolution of the infection as a function of time. In addition to the

infection curve predicted by the analytical model, we present three representative simula-

tion runs. These curves represent the 5th, 50th, and 95th percentiles across all simulations,

where rank is calculated based on the time when 70% of vulnerable hosts are infected. In-

tuitively, these curves represent a slow, average, and fastinfection instance depending on

which node was infected first.

First, we note that the model provides a decent approximation of the average infection

evolution, faithfully tracking the curves that represent the simulations. Furthermore, the

38


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60

Fra

ctio

n of

infe

cted

nod

es

time(hours)

Sim 5%Sim 50%Sim 95%

Model

Figure 3.1: Percentage of infected users as a function of time as predicted by the analytical

model and as demonstrated by simulation.

infection spreads to approximately 60% of the users within asingle day. Given that the

worm requires under a day to infect the majority of the population, we experimented by

starting the infection at different days during the period covered by the network trace. In

all cases we observed patterns very similar to those in Figure 3.1. We also found that the

evolution speed varied depending on the time of day when the first node was infected.

Worms that started during the daytime spread faster than those started at night. This is due

to the decreased movement of nodes during the night hours.

3.2.2 Mixing mobile and static nodes

So far we have assumed that mobile users cannot infect nodes connected to the static

(wired) network, This model corresponds to current security practices according to which

WiFi APs are separated from the rest of the network (e.g.,a company’s intranet) by fire-

39


walls. However, firewalls are complex devices that are notoriously difficult to configure.

Therefore, it is possible that a misconfigured firewall wouldallow infected wireless devices

to contact hosts residing in the static part of the network. More commonly, laptops can

connect directly to the static portion of the network after they have roamed across several

wireless domains (e.g.,during a business trip) effectively bypassing the barrier between the

static and mobile compartments of a network domain.

In this scenario, static hosts can be infected by mobile nodes and subsequently carry

the infection to other vulnerable nodes. Therefore, it is nolonger necessary for mobile

nodes to simultaneously reside in the same domain for the infection to spread; a mobile

node entering a network domain can contract the infection byinfected static nodes in that

domain. In order to understand how these infections spread,we modified the original

simulator to assume the worst case scenario, wherein an infected mobile node instantly

infects any domain that it enters. The “instant infection” assumption is valid even for

a uniform scanning worm (i.e. which follows a naive strategy of random scanning and

therefore one of the slower spreading worms). Even with a scan rate of 10 scans/sec and

domains with as few as 10% vulnerable nodes, one static node on the average is infected

within the first second from the entry of an infected user to the domain.

Figure 3.2.(a) presents the number ofnetwork domainsinfected as a function of time

when mobile nodes can infect the domains they visit. The infection spreads to about 65%

of the domains within a day. It then slows down considerably and takes a long time to

infect the remaining domains. This result might seem straightforward, given that 65% of

40


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 5 10 15 20 25 30 35

Fra

ctio

n of

infe

cted

hos

ts

time(hours)


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60

Fra

ctio

n of

infe

cted

hos

ts

time(hours)


Figure 3.2: (a)Rate of domain infections as a function of time with the total mobile popu-

lation (b) Rate of infection with only 25% of the mobile nodes

mobile nodes contract the infection within one day. In orderto investigate the relationship

between the number of mobile nodes carrying the infection and its spread over the set of

network domains, we repeated the previous experiment, witha randomly selected subset

of 1500 wireless nodes (25% of the original population). Thesurprising result, as Fig-

ure 3.2.(b) indicates, is that infection rates in this case are comparable to the previous case,

i.e. the infection reaches∼ 60% of the domains within a day. This result indicates that the

worm speed is not significantly hampered by the significantlysmaller set of cross-domain

carriers. This phenomenon can be explained by the association graph usually observed in

social networks [44]. In that context, as well as in the context of network domains visited

by mobile hosts, domain popularity has been shown to follow aheavy tailed distribution,

whereas a small number of domains are extremely popular followed by a large number of

less popular domains. As a result, the smaller subset of nodes is still likely to frequent at

the very popular domains thus fuelling the growth of the infection.

41


3.3 Detection

Thus far we have shown that a mobile infection can take up to a day to affect a sig-

nificant portion of the vulnerable population. Although this is fast enough to make human

defense mechanisms implausible, it is considerably slowercompared even to the naıve uni-

form scanning strategy, or more sophisticated variants such as flash worms that can spread

over the entire Internet in a few minutes [45].

The fact that such worms spread more slowly might lead to the conclusion that they

areeasierto contain. This, however, is false. On the contrary, mobileinfections are more

difficult to detect using conventional approaches, such as distributed network monitors [46,

47]. In the paragraphs that follow, we explain the underlying reason for this negative result.

3.3.1 Detection Speed

We compare the expected time to detect a mobile infection to the average detection time

of a uniform scanning worm. Here we assume that a single network telescopes is used to

detect the infection. We define detection time as the time elapsed from the first infection

until the first probe arrives to the address space monitored by the telescope(s). Suppose,

that the telescope covers a large fraction,α = 0.5, of the IP space used in the network

domain where it is deployed. Then, the expected timeT to detect the first instance of the

infection for a uniform scanning worm is given by:

42


H(T ) =∫ T

0I(t) · s dt

≈∫ T

0esft · s dt

= Nα

⇒ T = 1

s·f ln(N ·fα

+ 1) (3.6)

whereH(t) is the number of IP addresses scanned by all the infected nodes in [0, t], s is

the scan rate,N is the total number of domains, andf is the average density of vulnerable

nodes.

Substituting conservative values fors = 20 scans/min (the Witty worm had a scan rate

of roughly 1200 scans/min [48]),N = 1000, andf = 0.01 in Equation (3.6) we find that a

uniform worm will be detected within 15 minutes on the average. By this time the worm

has spread to less than 2% of the vulnerable population (calculated from the equation for

the uniform scanning worm). Furthermore, the placement of the telescope is immaterial

to the detection time. Thus, we conclude that such a telescope can be an effective early

warning device for typical worms.

On the other hand, since mobile worms scan only their local network, detection time

is governed by the speed with which infected mobile nodes enter the domain where the

telescope is located. Considering the same (randomly placed) single telescope, detection

will occur when the worm has spread to half of the domains on the average. Figure 3.2

provides the time for the worm to spread to 50% of the vulnerable domains as∼ 15 hours.

43


Within this time, the worm infection has already taken off, infecting a large number of

hosts. Once the worm enters the domain which contains the network monitor, detection is

much faster. On the other hand, since detection time is dominated by the time necessary for

the worm to enter the domain, using larger telescopes withina domain does not significantly

reduce detection speed.

In short, unlike traditional uniform-scanning worms, telescope size is not important

and random placement is of little use. On the other hand, given that the worm first infects

popular domains first, it is prudent to place worm monitors inthose domains.

3.4 Spatial evolution

Until now we have investigated the temporal behavior of the infection. However, an

equally interesting aspect is the infection’s spatial evolution, that is how the infection

spreads over the collection of network domains the mobile nodes visit. We note that Fig-

ures 3.1 and 3.2 flatten out considerably after an almost vertical growth during the middle

phase of the evolution graph. This behavioral change can be explained by dividing the

spatial evolution of the infection into a number of distinctphases. The infection initially

“moves” in the direction of domains which are extremely popular, since many nodes visit

them. This is the slow take-off phase. These popular domains(we call themhubs) are

closely connected by the group of mobile nodes which frequent them, thus forming adense

core of the network graph. When the infection reaches this core, an exponential increase

44


in the number of infected hosts occurs, as the majority of vulnerable nodes frequently visit

the core. Finally, the infection gradually slows down afterit has consumed the core and

extends towards domains with low contact rates (i.e. unpopular domains). Figure 3.3 illus-

trates this phenomenon where it is clear that popular domains are infected within the first

few hours of the infection.

0

2

4

6

8

10

12

10 20 30 40 50 60 70 80 90 100

Pop

ular

ity (

node

-hou

rs)

x 1

0^7

infection time (hours)

Figure 3.3: The first time an infected node is seen at a networkdomain as a function of

the domain’s popularity, defined as the number of cumulativenode-hours occu-

pancy of a domain.

3.4.1 Popularity

We define the popularity of a domain as the cumulative number of node-hours that

nodes spend in that domain. This definition accounts for boththe distinct number of nodes

visiting the domain as well as the length of time a node resides in the domain.

Intuitively, placing network monitors in the most popular domains yields the earliest de-

45


9

10

11

12

13

14

15

16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tim

e of

det

ectio

n (h

ours

)

% of domains as monitors

Figure 3.4: Detection time when monitors are deployed in thetop x% of the domains.

tection times. To quantitatively measure the effect of placing multiple monitors, we placed

monitors in the topx% of the domains and measured the detection times. As Figure 3.4

shows installing monitors in 10% of the domains reduced the detection time to about 10

hours. During this time the worm has spread to less than 10% ofthe hosts (as seen from

Figure 3.1). Installing additional monitors provides onlymarginal benefits, reducing in the

limit the detection time to a little over 9 hours.

3.5 Discussion

Deploying wireless network monitors may involve modifyingAPs to scan through

packets they forward looking for traces of malware or deploying honeypots acting as de-

coys. As we showed, placing such monitors in the top 10% of thedomains can help detect

the worm early enough. However, this strategy in itself is not sufficient to guarantee early

detection. We present two arguments to support this claim.

46


3.5.1 Popularity is dynamic

First, we investigate how domain popularities change over time and the effect these

changes have on detection time. For this purpose we use the access points from the previous

dataset [39] to calculate the popularity of each domain on a weekly basis. We then choose

an initial set of the 50 most popular APs (∼10% of the total AP population) during the

first week of the network trace and measure how this set compares with the set of top 50

APs for every other week. The similarity between the first andevery other weekly set is

estimated by calculating the dot product between the two sets and dividing the result by 50.

In this case a product of one indicates that the sets are identical, while zero indicates that

no common members exist between the two sets.

Figure 3.5.(a) plots how the similarity between the top 50 APs evolved during year

2004. It is evident that there are wide variations with two prominent dips around weeks

30 and 50. Closer inspection of the CRAWDAD dataset revealedthat during the Fall and

Spring sessions, the APs in the residential buildings were the most popular. On the other

hand, APs in the academic buildings and athletic centers were highly ranked during inter

sessions, explaining the aforementioned changes. Figure 3.5.(b) shows the corresponding

median worm detection time over time, when monitors are statically placed in the top 50

domains according to the popularity results of the first week. While it may seem that the

difference in the detection time is only a matter of two hours, varying between 10.5 and

12.5 hours, the effects of this difference are dramatic. As Figure 3.1 indicates, this disparity

results in a infection spread of<5% in the case of 10.5 hours, as opposed to∼30% when

47


the detection time is 12.5 hours. Thus, reducing the detection time window is crucial to

providing sufficient time if the worm defenses are to be effective.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 10 20 30 40 50 60

Sca

led

dot p

rodu

ct

Week index

10

10.5

11

11.5

12

12.5

13

13.5

0 10 20 30 40 50 60

Det

ectio

n tim

e (h

ours

)

Week index

Figure 3.5: (a) Similarity between the popularity of the top50 domains on a weekly basis

for 2004 (b) Median detection time if the monitors are deployed statically.

3.5.2 Evasive worms

The second reason why static placement of monitors is insufficient, is that worms can

potentially detect their presence and avoid the networks inwhich these monitors are de-

ployed. Rajabet al. have presented an efficientprobe-response attackthat can be used to

discover the locations of network monitors deployed on the (wired) Internet [49]. A similar

technique could potentially be applied in the context of mobile infections. In this case,

worm instances probe the domain they currently reside, using standard network tools such

as ping and ARPs, or even passively eavesdrop all ongoing communications to the AP. If a

domain is believed to host a monitor the worm will not attemptto infect any mobile nodes

48


in that domain, thus avoiding detection.

On the other hand, if avoiding popular domains, in which monitors are deployed, slows

down the infection to the point where human intervention is practical, then the threat posed

from theseevasiveinfections is minimal. To verify whether this is true, we simulated such

an evasive worm that does not try to infect the 50 most populardomains, and measured its

infection speed. Unfortunately, as Figure 3.6 indicates, the infection rate is still significant,

with 60% of the hosts infected within 3 days on the average.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70 80

Fra

ctio

n of

infe

cted

hos

ts

time(hours)


Figure 3.6: Worm evolution when the worm is inactive in the top 50 domains.

From the two arguments presented above it is clear that placing monitors in the most

popular domains is not a complete solution to the problem of early detection.

3.6 Background and Related Work

A large volume of research has focused on modelling Internetworms. Among these,

the classic homogeneous worm model assumed all-to-all nodeconnectivity and that every

49


susceptible node was a target of equal probability [50]. More recent models accounted

for non-uniform scanning strategies [51], as well as for thefact that node population is

not uniformly distributed over the IP address space [47]. However, much of the prior

work ( [41, 52, 53] among many others), primarily considers how malware propagates in

wired networks. Instead, we explore how mobility can facilitate the spread of infections

among groups of nomadic users traversing different networkattachment points such as

WiFi Access Points. In this case unlike previous scenarios,each infected node has a time-

varying infection transmission probability depending on its local scope.

In the context of mobile networks, Andersonet al. derived the speed of mobile worms

through simulations [54]. While our results seem to be in broad agreement, we focus our

attention on the actual infection evolution, so as to infer the worm characteristics. Similar

trace-driven studies covering infections over Bluetooth networks were performed by Suet

al. [55]. Unlike those previous studies, which are limited to simulations performed using

a particular trace, we propose a general analytical model that predicts the evolution of in-

fections over a wide range of mobility patterns. Epidemic spreading in ad-hoc networks

has been studied by Mickens and Noble in [56]. That work explained why traditional epi-

demic models fail in the case of mobile networks and proposeda new framework for such

networks. While that study focused on worms spreading within a single ad-hoc wireless

network, our model explains how infections are carried across a variety of networks by the

physical movement of mobile users.

The mobility model we use is similar to the semi-Markov modelpresented in [57].

50


Lee et al. developed a cumulative model for different user groups to obtain the AP-user

mobility patterns. Instead, we model the mobility patternsof individual users. We choose

to do so, because the derived mobility model is then used to calculate the contact rates

between mobile node pairs. This factor determines the rate at which the infection travels

among individual nodes.

Today, it is generally considered good practice to place mobile nodes in a DMZ sep-

arated from wired nodes. Various enterprise solutions exist for doing so,e.g., Cisco’s

network admission control [58]. We believe that these perimeter defenses by themselves

are insufficient and a more fine-grained approach is needed todetect and contain mobile

worms.

3.7 Summary and Future Directions

We presented and validated an analytical model that describes the evolution of worms

that exploit node mobility to propagate. We evaluated infection speeds in different scenar-

ios: first, when mobile users can only infect each other as they move across a collection

of network domains and second when infections can spread from mobile users to static

nodes. Our ultimate goal is to use this model to design effective detection and containment

mechanisms for this novel category of worms. While we touched upon the difficulties of

the detection mechanisms for this type of infections, we will discuss it in further detail in

the upcoming chapter.

51


Even with effective detection mechanisms, the feasibilityof policing nodes as they en-

ter popular domains is not straightforward. Numerous practical concerns for containment

mechanisms designed for mobile infections must be addressed, including how to exploit

topological information to limit the damage from potentially infected nodes, how to appro-

priately apply the notion of hard-LANs [59] in this setting,and how to track (in a tamper-

resistant manner) the movement of nodes across network domains.

Acknowledgements

This work was supported in part by National Science Foundation grant CNS-0627611.

We gratefully acknowledge the use of trace data from the CRAWDAD archive at Dartmouth

College.

52

Chapter 4

On the Detection and Origin

Identification of Mobile Worms

In the previous chapter, we discussed the various difficulties in the detection of mo-

bile worms. The detection of these worms is challenging because the majority of tech-

niques against zero-day infections rely on recognizing anomalous patterns in inbound traf-

fic (e.g., [46] among others), or outbound DNS, ARP, or failed connection requests from

local hosts [60, 61]. On the other hand, a mobile worm can find victims by eavesdropping

on the radio channel, thus generating no scanning/ARP/DNS traffic. Moreover, the alterna-

tive solution of using honeypots for detection is also ineffective because it requires placing

honeypots in the majority of the domains [62].

In this chapter, we present two mechanisms for countering mobile worms. The first

mechanism detects the existence of a worm spreading througha collection of wireless do-

53

CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS

mains while the second identifies the origin of the worm. Doing so involves identifying

the node(s) that initiated the infection as well as the nodesinfected during the very early

stages of the epidemic. In turn, origin identification enables further investigation into the

underlying causes and techniques used to breach the networks defenses and can provide

information relevant to law enforcement. These node identities can also be used to con-

tain the infection, by blocking their traffic or by automatically generating attack signatures

based on the traffic they transmit.

Both proposed mechanisms extend the Random Moonwalk technique [63] and only re-

quire network flow records consisting of the start, duration, source, and destination of all

flows within a wireless network domain. These flow records arecollected at every domain

and are either aggregated to a centralized database, or are available through a federated

database, similar to the network forensic alliance (NFA) proposed in [64]. We first show

that the original moonwalk is ineffective against mobile infections and then present two

new heuristics that can detect and identify such infections. We evaluate the performance of

the proposed algorithms through simulations driven from network traces collected from a

university-wide wireless network. Our results show that a mobile infection can be reliably

detected before it infects 10% of the vulnerable populationin a network with hundreds

of domains and thousands of mobile nodes. Furthermore, the proposed identification al-

gorithm limits the search for the initial infection victimsto within 2% of the mobile node

population. Working in concert, the two algorithms we present can effectively protect users

against stealthy mobile worms.

54


This chapter has six sections. In the following section, we review the standard moon-

walk algorithm. Section 4.2 describes how the moonwalk algorithm can be modified for the

online detection of mobile worms. In Section 4.3 we show how to trace the evolution path

of a mobile worm. Finally, Section 4.4 presents related workand we close in Section 4.5.

4.1 Background

As part of our previous work, we showed that mobile infections can spread through

tens of thousands of victims located in hundreds of domains within a day [62]. We also

showed that a mobile infection initially “moves” towards highly popular domains, because

many nodes visit them. When the infection reaches these popular domains, its growth rate

rapidly increases and in the final phase it slowly spreads to the remaining domains.

Intuitively, placing network monitors and honeypots in themost popular domains yields

the earliest detection times. In fact, we showed that by installing monitors in∼10% of the

most popular domains, one could detect the infection while it is still in its early phase.

Deploying such wireless network monitors involves modifying APs to inspect the packets

they forward or deploying honeypots acting as mobile nodes.Unfortunately, deploying

monitors in the most popular domains is insufficient for a number of reasons. First, domain

popularity changes over time depending on the users’ mobility patterns. Second, mobile

worms can potentially detect the presence of monitors and avoid popular domains in which

they may be deployed. Similarprobe-response attacks, used to discover the locations of

55


network monitors deployed on the (wired) Internet have beendiscussed in [49,65]. Mobile

worms could use standard tools such as sendingICMP or ARP requests, or even eavesdrop

to infer the size of a domain and avoid highly popular domains. Finally, worm origin

identification is almost impossible if monitors are not deployed in every domain.

4.1.1 Random Moonwalks

The random moonwalk is a post-mortem method for identifyingthe origins of a worm

attack on the Internet using network flow data [63]. Specifically, given a set of network

flow records corresponding to a host contact graph, a moonwalk starts at an arbitrarily

chosen edgee1 = 〈u1, v1, ts1, t

e1〉, whereu1, v1 are the source and destination andts1, t

e1, the

start and end times of the flow respectively. The next edge backward in time is selected

uniformly at random from the set of edges that arrived atu1 within the past∆t seconds,

i.e. e2 = 〈u2, u1, ts2, t

e2〉 andte

2 < ts1 < te

2 + ∆t. This process continues for a maximum

number of hops or until no prior edge is found. Multiple moonwalks are taken and the

edges that appear with the highest frequency across all moonwalks are computed. These

edges are likely to be the top-level causal edges of the worm tree (i.e. edges that initiated

the infection). The intuition behind this approach is that because worms generate tree-like

contact graphs in which a small number of early malicious edges tend to be responsible

for a large number of edges further down the tree, the initialcausal edges will be traversed

multiple times and will therefore have high occurrence frequencies.

The effectiveness of the moonwalk algorithm decreases as the worm becomes stealthier,

56


thus generating smaller amounts of excess traffic. For this reason tracking mobile worms is

especially challenging for the moonwalk algorithm, as infected nodes eavesdrop to discover

victims instead of scanning for them. Because an infected node only seeks victims within

its current domain, it is enough for the moonwalk algorithm to focus on the intra-domain

flows. However, even considering this reduced edge set, edgefrequency is not a reliable

indicator of the initial causal edges of a mobile infection.In fact, as we shall show shortly, it

is impossible to even detect the presence of a mobile infection using the standard moonwalk

algorithm. The underlying reason is that a typical host contact graph is globally sparse but

has considerable local correlation. In other words, the density of flows between nodes

in then same domain is higher than those across domains, for both the worm and non-

malicious traffic. Moreover, global and local contacts are made at different timescales.

While local contacts occur in the timescales of normal host connections, the timescales of

inter-domain contacts are governed by the, usually slower,physical movement of nodes

across domains.

4.2 Mobile Worm Detection

4.2.1 Random Moonwalks and Mobile Worms

To demonstrate the shortcomings of the standard moonwalk algorithm, we simulate a

mobile infection that spreads over a group of mobile nodes traversing multiple network

57


domains. We do so using two models: one describing the mobility pattern of nodes across

domains and another reflecting the traffic patterns of mobilenodes within a domain. We

derive the first model from traffic traces collected at the university campus in Dartmouth,

available through CRAWDAD [39]. The trace we use contains 626 domains (i.e., APs) and

over 6,000 nodes and was collected from a campus-wide WiFi network between 9/23/2003

and 12/10/2003. Each trace entry corresponds to the time that a host, identified by its MAC

address, connected to one of the network domains. The trace also includes the special

’OFF’ location, signifying a host’s departure from the network.

To the best of our knowledge, there are no datasets that capture traffic that originates

and terminates within the same wireless domain. For this reason we generate traffic syn-

thetically. Specifically, we build a flow model using measurements of intra-domain traffic,

collected over a period of two weeks from the wireless APs in the Information Security

Institute at Johns Hopkins University. Note that applications such as FTP create two TCP

flows, one for control messages and one for data. We combine all the flows corresponding

to the same transaction into a singlesemanticflow. We then model semantic flow inter-

arrival times using a Lognormal distribution and flow sizes using a bi-Pareto distribution,

as suggested by [66]. The parameters for these distributions are fitted from the collected

packet traces. Finally, we select the size of the moonwalk’stime window∆t to maximize

the walk lengths and set the hop count value to a large value, to allow the moonwalks

to continue as far back as possible. Table 4.1 summarizes allthe parameters used in our

simulations.

58


Description SettingNumber of domains 626

Number of mobile nodes 6101Flow Inter-arrival Model Lognormal

Flow Duration Model Bi-ParetoMean Domain Residence Time(TR) 67 min

Mean OFF Time 315 minMoonwalk Window Size(∆t) 300 min

Maximum Moonwalk Hop Count(d) 50

Table 4.1: Simulation Parameters.

Given these parameters, we simulate two scenarios: one withonly normal traffic and

one in which a mobile worm is injected at a random network nodeat a certain point in

time. We let both simulations progress until the time when the worm has infected 65% of

the network’s nodes and then invoke the random moonwalk algorithm for each of the two

scenarios1.

The top panel in Figure 4.1 presents the results of the randommoonwalk algorithm for

the network with no malicious traffic, while the bottom half presents the results when a

mobile infection is injected att = 10, 000 sec (∼ 167 min). The y-axis in these graphs

corresponds to the frequency with which a flow that starts at certain point in time occurs

over the set of random moonwalks performed. Therefore, a large frequency value indicates

an edge that was traversed during multiple moonwalks.

When the moonwalk algorithm executes on an Internet trace that contains an actively

spreading scanning worm, the initial causal edges of the attack have the highest frequencies

creating a pronounced spike in the frequency graph (see [63]). In contrast, as is evident

1Similar results were derived for other infection percentages.

59


0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0 100 200 300 400 500 600 700 800

Fre

quen

cy

Time (minutes)

0

0.001

0.002

0.003

0.004

0.005

0.006

0 100 200 300 400 500 600 700 800

Fre

quen

cy

Time (minutes)

Figure 4.1: (a) Random moonwalk on a network with no malicious traffic. (b) Randommoonwalk on the same network when a worm is injected att ∼ 167 min. The y-axisrepresents the frequency with which flows starting at a particular time, appear in the set ofpaths traversed by the moonwalks.

from Figure 4.1, there is no marked increase in edge frequencies when a mobile worm is

spreading. Comparing the two cases, it is difficult to even infer the existence of a worm in

the lower graph.

4.2.2 Proposed approach

As the results from the previous section indicate, edge frequency is not an effective

indicator of infection in mobile networks. Instead, we use adifferent heuristic—the average

60


0

1

2

3

4

5

6

7

8

100 200 300 400 500 600 700 800 900 1000 1100

Moo

nwal

k Le

ngth

time (minutes)

Worm (f = 1)Norm

Worm (f=0.75)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

100 200 300 400 500 600 700 800 900 1000 1100

% o

f Inf

ecte

d no

des

time (minutes)

f=1f=0.75

Figure 4.2: (a) Average moonwalk length for a network with nomalicious traffic and anetwork in which a worm is injected att ∼ 420 min. Graphs are shown when 100% and75% of the population is vulnerable (b) Percentage of infected nodes as a function of timefor the same worm.

moonwalk length. The intuition for selecting this attribute is that as the worm spreads, the

host contact graph becomes inherently more dense as infected nodes contact other nodes to

spread the infection. As a result, the length of the moonwalktends to increase. Moreover,

these contact paths tend to span multiple network domains, which is unusual for normal

traffic. Based on these observations, we posit that a worm canbe detected by noticing a

steep increase in the average moonwalk length.

To test the validity of this thesis, we compute the average moonwalk length for the

61


0

1

2

3

4

5

6

7

8

100 200 300 400 500 600 700 800 900 1000 1100

Moo

nwal

k Le

ngth

time (minutes)

HighNorm

Low

Figure 4.3: Average moonwalk length for a network with different volumes of normaltraffic. The curve labelled ’High’ corresponds to double thevolume of traffic in ’Norm’,while the curve labelled ’Low’ represents a scenario in which the traffic is halved.

simulated network presented in the previous section and observe whether the introduction

of a worm creates a marked increase in the average moonwalk length. To make the worm

even more stealthy, we even allow infected hosts to avoid contacting nodes which they had

either previously subverted or attempted to subvert.

Figure 4.2.(a) presents the average moonwalk lengths for a network with no malicious

traffic and a network in which a worm is injected at timet = 25, 000 sec (∼ 420 min). It

is clear from this graph that the average moonwalk length increases considerably after the

infection starts. At approximately 650 minutes into the simulation, the walk length in the

worm scenario is almost twice that of the normal case. By thattime the worm has spread to

less than 10% of the vulnerable population (see Figure 4.2.(b)). This observation is crucial

for early detection because it provides containment strategies more time to be effective.

The same experiment was conducted assuming that only 75% of the mobile population is

vulnerable. As Figure 4.2.(a) shows, the difference in moonwalk lengths is still significant.

62


More importantly, the results from Figure 4.2 suggest that asimple threshold detector

could alert network operators about the presence of a mobileworm at its early stage of

infection. Briefly, such a detector periodically calculates the moonwalk’s length and keeps

a running average of this length (e.g. using an exponentially-weighted moving average).

When the difference between the current length and the long-term average passes a thresh-

old, the detector raises an alarm about an actively spreading worm. Since user traffic can

vary over time (e.g. traffic volume during the day is usually higher than during the night),

we conducted a simulation wherein we measure the average moonwalk length under vary-

ing traffic volume. Figure 4.3 represents the average moonwalk length when the normal

background traffic is halved and doubled. As can be seen, while there is a small increase in

the average moonwalk length, it is not as marked as in the casewhen a worm is present.

4.2.3 Effect of infection on moonwalk length

We analyze how fast the moonwalk length increases when a wormis present. To do so,

we use simplifying assumptions about the host contact graphand the worm attack to make

the analysis tractable. The goal of this analysis is not to provide a closed form solution, but

instead to support the effectiveness of the detection technique proposed in Section 4.2.

Assume that the host graph hasH hosts,f percent of which are vulnerable, and the

worm is a uniform scanning type of worm with a scan rates scans/unit of time. Further-

more, we assume that all flows, malicious as well as normal, last for a unit of time. Letc be

the average number of non-malicious flows into a node over a unit of time. Obviously, if

63


c > 1, all our moonwalks will have a lengthd, whered is the upper bound on the number of

steps a moonwalk can follow, and the moonwalk is in fact useless. On the other hand, the

contact graph of non-malicious traffic is generally sparse and c << 1. Finally, we assume

that the worm starts att = 0 and the moonwalk step is one time unit long.

Let ln(i) be the average length of a moonwalk under normal traffic at time stepi. Then:

ln(i) = (ln(i− 1) + 1) c i = 1, · · · , N (4.1)

ln(0) = 0;

Thereforel(i) = 1−ci

1−c. Note that asc → 1, ln(i) → i. This is intuitively true as there are

always incident flows on a node.

Let lw(i) represent the average length of the moonwalk, when a scanning worm is

present. Then:

lw(i) = (lw(i− 1) + 1)(c +sI(i− 1)

H ) (4.2)

lw(0) = 0;

whereI(t) is the number of hosts infected at timet andsI(i−1)/|H| represents the average

number of scans from infected hosts that arrive at the node ina unit of time.

Let ∆(i) denote the difference in the moonwalk lengths for the normaland the worm

scenario. By using Eqs.(4.1) and (4.2) we can then express∆(i) as:

64


∆(i) = lw(i)− ln(i)

∆(i) = c∆(i− 1) + (lw(i− 1) + 1)sI(i− 1)

H (4.3)

i = 1, · · · , N (4.4)

If the moonwalks start at timeN , we can express∆(N) by unfolding the recursion as:

∆(N) =scN−1

H

N−1∑

i=0

(lw(i) + 1)I(i)

ci(4.5)

Sincelw(i) > ln(i) ≥ c and by virtue of Eq.4.1.

∆(N) ≥ scN−2

H

N−1∑

i=0

ln(i + 1)I(i)

ci(4.6)

Simply using the last term of the summation in (4.6) and substituting values for a uni-

form scanning worm, (e.g. for the Witty worm [67]s = 350 scans/tick and conservative

values forf = 0.03,c = 0.5), then at the point when the infection has overtaken 10%of the

vulnerable population, the dilation in the walk length is greater than twice the moonwalk

walk length in the normal traffic scenario.

4.3 Worm Identification

The modified moonwalk algorithm from the previous section can detect the presence

of a mobile worm. We now show how it can also be adapted to identify the first infected

65


node (also known aspatient zero) as well as reconstruct the initial infection sequence.

Depending on the speed of detection, the identities of theseinfected nodes can then be used

to thwart the infection, for example by blocking traffic fromthose nodes and inspecting

their traffic to generate attack signatures [68,69].

At the same time, it is generally impossible to pinpoint the patient zero(s) of any in-

fection using a purely flow-centric approach such as moonwalks. To see this, consider a

scenario in which the patient zero was contacted by a benign node prior to the start of the

infection. It is difficult to infer which of these two nodes isin fact the true patient zero

without inspecting the contents of the flow between these nodes or the nodes themselves.

What the moonwalk algorithm can achieve is to considerably reduce the number of nodes

that should be inspected in order to reveal the origins of theinfection.

Specifically, the algorithm identifies a small set of candidate infection trees, one of

which is the true infection tree. In this context we define theinfection tree as the graph

induced by the worm node infection sequence. The first step inthis process is to iden-

tify each of these trees’ roots. To do so, we modify the moonwalk algorithm in three key

aspects. First, we include a stopping condition to the moonwalk, a parameterTs which de-

notes the estimated time when the infection started. We thenhalt every moonwalk when it

proceeds pastTs. This parameter can be estimated using the detection algorithm described

in Section 4.2. Specifically, we showed that just before the infection enters its exponential

increase phase, the average path length departs from the normal average. Then, if the worm

is detected at timeTd, we setTs = Td − ∆T . The value of∆T depends on the mobility

66


pattern and worm characteristics. In general, given a specific network and mobility model,

∆T should be set to the amount of time required for an infection to propagate to a popular

domain from any wireless domain in the network. Second, in addition to edge frequencies,

we record the frequencies with which each root node (i.e. alleged patient zero) appears

in the moonwalks as well as the average walk lengths associated with each of these root

nodes. Finally, we start each walk randomly, but only from nodes within the topp% popular

domains. The rationale for this choice is based on the observation that a mobile infection

in its initial phase moves towards domains of high popularity [62]. Therefore it is more

likely, that during the early stages of the infection a malicious flow will be encountered in

popular domains. We setp = 25% to maximize the probability that at least some of the

moonwalks will follow a backwards path on the infection tree.

Once we carry out the random moonwalks, we draw the scatter plot of the root node

frequency versus the walk length. The outliers in this scatter plot, that is the nodes with

high frequencies and long walk lengths, are the possible roots of the infection trees. The

intuition behind this approach is two-fold:(a) The frequency of the actual patient zero is

high because worms tend to form tree-like structures and therefore multiple reverse paths

lead to that node.(b) Unlike worms, non-malicious node contacts do not tend to form long

paths.

We evaluate the performance of the algorithm using the simulation setup presented

earlier. Figure 4.4 illustrates the scatter plot from the output of 10,000 moonwalks on2.4×

105 flows, with∆T set to five hours. We use a simple filtering algorithm for identifying the

67


0

5

10

15

20

0 0.002 0.004 0.006 0.008 0.01 0.012

Moo

nwal

k Le

ngth

(in h

ops)

Frequency

Worm Tree

Figure 4.4: The scatter plot of walk length versus root node frequency. The square dotindicates the actual patient zero.

outliers from this scatter plot. Points having walk length and frequency greater than 90%

of the rest are chosen as outliers. The dashed horizontal andvertical lines in Figure 4.4

represent these 90th percentiles. The points in the upper right quadrant of the graph are the

roots of the candidate infection trees. The actual patient zero is also shown in Figure 4.4

with a square dot. While more sophisticated outlier detection algorithms could be used, we

found in practice that this simple approach produces a smalllist of candidates that always

includes the actual patient zero.

Starting from each of the candidate patient zeros, we reconstruct the candidate infection

trees using all the edges that were traversed during the moonwalk phase. This is done using

a simple breadth first search (BFS) discovery. Figure 4.5 presents the results of this traversal

for three of the candidates from Figure 4.4. The nodes in these trees need to be further

inspected for signs of infection. While the actual inspection method is out of the scope

of this work, we can show that only a small percentage of nodesmust be inspected. For

example, if we traverse all tree nodes up to depth three, then∼ 2% of the total population

68


must be inspected. As an aside, the actual infection tree shown in Figure 4.5 was the one

rooted at node 5344 and all the nodes in that tree were actual infected nodes. This result

is encouraging because it indicates that the number of nodesfalsely identified as active

spreaders is rather low.

Last, we investigate the effect of differentTs (estimated infection start time) values on

the outcomes of the algorithm. In our experimentsTs is set to the detection time minus∆T

(five hours). In most cases, we noticed that this is a slight underestimation of the actual

infection onset. The problem with underestimation is that the true patient zero could have

been contacted by other nodes in this time, thereby reducingthe frequency of these nodes

in the moonwalk. Nonetheless, the proposed algorithm is still effective, as the nodes which

contacted the true patient zero still have relatively high frequencies and long path lengths

in lieu of the patient zero having a high frequency and path length. The downside is that as

the estimation error increases, the number of nodes that need to be inspected also increases.

In our experiments, we noticed that even if the actual start time is underestimated by about

10,000 seconds (∼2.8 hours), we needed to inspect at most5% of the total node population.

4.3.1 Discussion

We have shown that the infection tree has both a high patient zero occurrence frequency

and a long walk length. However, as the volume of normal traffic increases, it adds more

noise to the selection algorithm. In other words, the normaltraffic starts forming trees with

lengths comparable to those of infection trees. As a result,the number of candidate trees to

69


5344

963 3658 2880 2223 1344 5717

3901 3917 3569 527 272 1195 1050 89 3013 2826

1109 4774 487 2054 333 3468 1575 853 44 390

3114 1958 4012 1617 3798 1833

3884 2988

1167

2018 3867

3256

2148

4709

Figure 4.5: Candidate infection trees reconstructed usinga BFS search. The tree rooted at5344 is the actual infection tree. All nodes in this tree wereindeed infected by the worm.The trees rooted at 1167 and 2148 are benign. A directed edge between nodeX andYindicates thatX initiated at least one flow toY .

inspect increases. In the extreme case, the infection tree could remain ’hidden’ within the

volume of normal traffic.

We investigate the effect of the normal traffic volume by running the proposed identifi-

cation algorithm on networks with increasingly higher levels of normal traffic. To do so, we

keep the Lognormal distribution of the inter-arrival time of flows presented in Section 4.1.1,

but decrease the mean inter-arrival time thus generating increasing levels of normal traf-

fic. As Figure 4.6 illustrates, the percentage of mobile nodes that should be investigated

for signs of infection increases as hosts spawn flows faster.Nonetheless, two encouraging

observations can be made. First, the algorithm continues toidentify the infection tree as

the volume of normal traffic increases. Second, a decrease inthe inter-arrival time by three

orders of magnitude increases the number of nodes that must be inspected only by sixfold.

70


0

1

2

3

4

5

6

7

1 10 100 1000

Per

cent

age

(%)

Inter-arrival time(min)

Figure 4.6: Percentage of mobiles nodes that need to be inspected for signs of infection asa function of the normal traffic intensity.

Even when the average flow inter-arrival time is roughly ten minutes—a high value for intra

domain traffic in wireless networks—the algorithm needs to inspect only 6% of the overall

node population.

Finally, we briefly address a few issues on why the deploy-ability of the proposed

framework could be practical with reference to the latency in real time log collection, di-

verse background traffic and the size of connection logs. An in depth study is deferred

for future work. As noted earlier, we assume that flow recordsare either aggregated at a

centralized database, or available through a federated database (as proposed in NFA [64]).

As seen from the figure 4.2, the detection algorithm can bear upto a latency of an hour,

while still providing early detection. With regards to other types of background traffic, P2P

traffic also tends to form long paths. However, they don’t cause large changes in the path

lengths. Since our detection algorithm concentrates on thechanges in the path lengths,

worm and normal P2P traffic can be differentiated. With regards to the amount of space

required to store the flow headers of the intra-domain traffic– assuming that the source host

71


ID, destination host ID, domain ID, start and end times of a flow require four bytes each, a

flow record can be described in 20 bytes. In this case, even if we assume that nodes initiate

new connections at a rapid pace of one flow every minute, the storage space required for

the simulated network of 6,000 nodes is a modest 165 MB/day. Moreover, we expect the

storage space to increase slowly as the number of hosts increases because the host contact

graph is sparse.

4.4 Related Work

The threat of mobile infections was first discussed by Anderson et al. [54]. Saratet al.

derived the speed of mobile worms through analysis and simulations [62], while more re-

cently attacks against metro-area wireless networks were discussed by Akritidiset al.[70].

Our work is inspired by the work of Xieet al. on random moonwalks [63, 64]. How-

ever, that technique is primarily a post-mortem tool for identifying infected nodes, in the

context of Internet worms, using host contact graphs. As we showed in Section 4.1.1, the

effectiveness of the standard moonwalk method decreases rapidly as worms become more

stealthy and infections are carried across domains by mobile nodes. We address the limi-

tations of the original approach by exploring different heuristics such as moonwalk length

and node occurrence frequency and show that moonwalks can infact be used as a tool for

the detection and origin identification of mobile worms.

Origin identification has been studied in the context of Internet worms. For example,

72


Kumaret al. presented a forensic analysis of the Witty Worm [71], by reverse engineering

the random number generator used by the worm [72]. In contrast, our technique is flow

based and thus worm agnostic. Recently, there has been a bodyof work on securing en-

terprise networks [73, 74]. While the network environment is orthogonal to the one used

in this chapter, we believe that the moonwalk technique presented herein can play a role

within such centralized architectures to detect and provide forensic analysis of malicious

activities.

4.5 Summary and Future Work

This chapter presents mechanisms to detect the existence and to identify the evolution

of worms spreading through a collection of wireless domains, carried by the physical move-

ment of mobile hosts. The proposed approach extends the existing framework of random

moonwalks by focusing on the combination of moonwalk lengths and node frequencies to

detect the existence of a stealthy worm and determine the identities of the infection’s initial

victims.

While we evaluate these algorithms in the context of mobile networks, we believe that

they are also applicable to other worm scenarios. Because moonwalks essentially cull out

worm edges in the presence of noisy background traffic, we believe them to be robust in

the presence of missing traffic or in a distributed scenario in which some domains are non-

cooperative.

73


Acknowledgments

We gratefully acknowledge the use of trace data from the CRAWDAD archive at Dart-

mouth College. We thank Fabian Monrose, Razvan Musaloiu-E., and Moheeb Abu Rajab

for their suggestions. Brian Hoffman helped immensely in collecting the intranet data

traces. This work was supported in part by the National Science Foundation through grant

CNS-0627611.

74

Chapter 5

On Web Browser Protection

The web browser is the most widely used network application on the Internet today. The

past few years have seen a spate of browser related vulnerabilities e.g. cross site scripting

(XSS), cross site forgery attacks (CSRF) etc. The majority of these attacks exploit the trust

placed by a web browser on a web site providing content. Whilesuch trusting browsers

function well for single source content, recent years have seen web sites evolving from

essentially single-principal sources to one in which a single page contains a mashup of

code and data from multiple, perhaps mutually distrusting sites. The increasing number

of attacks exploiting web browsers, indicate that the current security policies that browsers

enforce are clearly inadequate.

In this chapter, we focus on novel browser abstractions withthe aim to alleviate browser

vulnerabilities. Specifically, we propose two new abstractions for (a) content which needs

to be completely isolated from other domains (b) shared content amongst domains with ac-

75

CHAPTER 5. ON WEB BROWSER PROTECTION

cess control enabled. Existing browsers support the isolated abstraction using the<frame>

or <iframe> tag. However, the origin of the frame and the document must bedifferent.

Consequently, this technique is ineffective against same site XSS attacks, like the Samy

worm [75]. Furthermore, the abstractions presented herein, allow controlled sharing of

content. To illustrate this, we use the example of a hypothetical social networking site in

which users can view, share and execute each others’ java scripts. The abstractions in to-

day’s browsers are not granular enough to accomplish such sharing, without compromising

security, due to the danger of an XSS attack. For example, if Alice and Carol are allowed

to submit scripts to Bob’s profile page, then Bob has no way of selectively executing only

Alice’s script, when Bob views his profile page in the browser. Allowing controlled access

to the entities of a HTML page e.g. DOM, cookies,etc can eliminate XSS attacks, even

when scripts are allowed as user generated inputs. We designa multi-principal browser,

which supports these abstractions. For this, we modify the Konqueror browser source code

as a proof of concept of the effectiveness of the abstractions. The changes are backwards

compatible with legacy systems.

Briefly then, the rest of this chapter is as follows. Section 5.1 provides a brief overview

of the present day browser protection mechanisms and the vulnerabilities present. Sec-

tion 5.2 details the abstractions introduced in this chapter. In Section 5.3, we describe the

implementation of the abstraction in the Konqueror browser. We present related work in

Section 5.4. Finally, we conclude and present avenues for future work in Section 5.5.

76


5.1 Background

Web pages of today provide a rich, interactive experience driven by client side scripting,

enabling asynchronous requests. Moreover, web pages are more commonly multi-principal

e.g web mashups. These websites are composed of content originating from more than one

site. For example, the pipes.yahoo.com mashup wizard usersconnect to pipes.yahoo.com

to get data. The request is proxied to the real data providersand the response data is

then passed back from pipes.yahoo.com to the mashup. For example, a custom mashup

could possibly source image data from Flickr correspondingto a news item from CNN.

Furthermore, the AJAX (Asynchronous Javascript and XML) programming model is com-

monplace today in applications like Google Maps. AJAX uses client-side javascript to

maintain interactivity while network-centric requests are relayed in the background using

XMLHttpRequest (Asynchronous XML) calls to the server. Browsers of today, are inca-

pable of handling such complex access control policies. In practice, mashups are created

using third-party proxies which reformulate the page before being sent to the browser. This

is not granular enough to be either secure or scalable.

5.1.1 Same Origin Policy

The same origin policy (SOP) governs the control access on today’s browsers. The

philosophy of the SOP is simple: it is unsafe to trust contentloaded from third party web-

sites, in the context of a webpage. As semi-trusted scripts are run within the sandbox,

77


they should only be allowed to access resources from the samewebsite, but not resources

from other websites, which could potentially be malicious.Two pages share the same ori-

gin if the protocol, port and host are the same for both pages.Every browser window,

<frame> and<iframe> is associated with an origin. While it is not possible to directly

query websites for data due to the same origin policy, the<script> tag does not honor the

same-origin policy. A web page might contain<script> elements sourced from different

domains. Such scripts function under the purview of the document’s origin and can access

all of the document’s resources. For example, if a page a.com/index.html contains a script

tag<script src=”http://b.com/myscript.js>, then myscript.js has access to all the DOM el-

ements, cookies and data of a.com’s index.html page. However, myscript.js cannot access

any resource pertaining to b.com within this context.

5.1.2 XSS attacks

In XSS, an attacker often exploits the case where a web serverdirectly sources user

input into a dynamically generated page, without first filtering the input. Attacks can either

be either persistent or non-persistent. Persistent attacks (or stored vulnerability) occur when

data provided to a web site by a user is stored on the server, but is not sanity checked

for script entities. The malicious script then executes with the site as its origin, and can

send sensitive data back to the attacker. A classic example is the Samy worm [75] which

propagated across the MySpace social-networking site. Non-persistent attacks are reflected

attacks, wherein data provided by a web client is used immediately by the server side to

78


generate a page of results for that user. If unvalidated usersupplied data is included in

the resulting page without sane HTML encoding, client-sidecode can be injected into the

dynamic page. An attacker can then use social-engineering to trick a user into visiting the

URL, which contains the malicious script in the dynamic page.

The root cause of XSS attacks are unsanitized user input and unexpected script exe-

cution. Typically server side applications of today, sanitize input by HTML encoding all

user input (e.g.%3c inplace of<). However, websites often allow rich user input, in the

form of HTML or images. Parsing for scripts within such rich user input is non-trivial, as

demonstrated by the many existing ways of injecting a script[8].

The other known approach to defend against XSS attacks is to constrain user data by

using the SOP policy. A cross-domain iframe is used to display all user-supplied data,

inclusive of scripts. For example, Alice’s content page is served fromalice.server.com

while Bob’s user generated content is put into an<iframe> sourced frombob.server.com.

Since the origins are different, there is isolation. Such anapproach is however not scalable

because the server has to maintain different domains for every user generated input. Fur-

thermore, script interactions with the rest of the page are then restricted and the display is

not flexible.

79


5.2 Trust Model

As discussed previously, existing browsers depend purely on the<script> tag for cross-

domain communication. In lieu of this, we introduce two additional abstractions one for

isolating and the other for controlled access sharing, using the<isolate> tag. Consider a

webpage containing a tag such as:

<isolate src=”http://server.com/alice.html” id=110>

This creates an isolated environment. This is akin to the iframe environment with a different

source. Since, the content within<isolate> is private, when the src attribute indicates a

path from a different domain, the enclosing page cannot access the content of the page

within the isolatetag - this is a side effect of the SOP policy. However, when thecontent

comes from the same domain, the enclosing content can fully accessed theisolatedcontent.

Finally, the isolated content cannot reach out to access (read/write/execute) script elements,

DOM elements, cookies etc. of the enclosing page, even if their origins are the same.

Sharing can be enabled by treating the id as a bitmask. To illustrate this, consider

another isolate tag in the same page.

<isolate src=”http://server.com/bob.html” id=101>

An example of an access control scheme could be defined to allow sharing between all the

isolated environments, which have their most significant bit (MSB) set to 1. The method of

access control could infact be made similar to process access control in traditional operating

systems. For now, we treat the topic of access control sharing as an avenue for future work.

80


DEFENSE AGAINST XSS

As was seen in section 5.1, XSS attacks arise due to a confuseddeputy problem in

the browser abstractions. Since, the browser is unable to distinguish between “good” and

“bad” scripts, an all or nothing approach is used. Using the abstractions presented herein,

the web server serves unauthorized content within an isolate tag.

We will briefly outline our defense strategy, using the example of a message forum

where users are allowed to post scripts besides HTML. Every user generated input is pack-

aged into an isolate tag, such as:

<isolate src=”http://forum.com/userA.html” id=101>

Here id is specific to a user. Every user is assigned different ids. Inthis way, sharing of

scripts can also be implemented, if desired. Now, scripts can access elements within the

isolate subtree, but any access to the enclosing page is denied. The rationale behind this

technique is as follows: If we consider user generated inputas tainted information, the

server can easily distinguish input from different users and differentially taint them using

user ids. The isolate tag then instructs the browser to treattainted content appropriately,

either isolating it or allowing it to be shared etc. This technique works effectively even in

the case of a non-persistent XSS attack too, whereby the usergenerated input is isolated

from the enclosing page.

81


5.3 Konqueror Implementation

We built a proof of concept browser based on the abstractionsdesigned above. To do

so, we modified the Konqueror browser source [76] code and runs on linux. The changes

are backwards compatible with legacy systems.

We only implemented the isolate abstraction, leaving the sharing abstraction as a part of

our future work. Our extension to the konqueror source code is an extension which sits in

between the javascript interpreter (KJS) and the HTML browser engine (KHTML). When-

ever a script element is encountered, KHTML passes off this script to the KJS interpreter,

which then returns results of evaluation to the KHTML. Script execution can manipulate

DOM elements. Therefore, whenever a DOM object is encountered within a script, calls

are made back to the KHTML library for references to these objects. We illustrate the call

flow using an example (fig 5.1).

<script type=text/javascript>

document.write(”The title of my parent is” + parent.document.title);

</script>

The KHTML library calls the evaluate function of the KJS interpreter and passes it the

script code. However, KJS needs a reference to the objectdocumentto resolvedocu-

ment.title. A call is made back to the KHTML library to obtain the reference to the cor-

responding DOM object. Similarly, a call is made from KJS back to the KHTML library

82


DOM Object reference

KHTMLPart::ExecuteScript

KJS::Interpreter::Evaluate

Javascript Interpreter

KHTML Browser engine

KJS::Window::Get KHTML

Extension Proxy

DOM Object access

Figure 5.1: The proxy extension overlayed on top of a simplified javascript call graph.

to resolveparentof thedocumentobject. Our extension acts as a proxy between KTHML

and KJS. Whenever calls are made from KJS to KHTML, we check ifthe calls are allowed

as per the isolation restriction. In the above example, if this script is from anisolatecon-

tent, then the KHTML library returns thedocumentitself as the parent. Otherwise, the true

parentis returned.

5.4 Related Work

There has been a plethora of work in studying and protecting browsers and the under-

lying operating system from browser vulnerabilities such as drive by downloads. Moschuk

et.al [77] conducted a study of spyware on the web by crawling18 million URLs in May

2005. HoneyMonkey by Wang et. al detects exploits against Windows XP while visiting

83


sites in Internet Explorer. Provos et.al [78] crawl millions of pages to determine which

URLs are malicious (drive-by-downloads etc). These pages can later be marked so as to

caution the user against visiting these URLs. This work is different as it focuses on the

weaknesses in the security policies of the browser rather than web-based infections. Fur-

thermore, these approaches typically deploy a browser within a virtual machine and detect

any changes to the operating system, while visiting websites to mark URLs. This approach

is heavyweight and doesn’t work with vulnerabilities such as XSS.

Several new browser communication proposals have emerged due to the limitations

of the SOP [79], wherein a site may request information from any other site, and the re-

sponder can check the request to decide how to respond e.g. Flash. while such policies

are verifiable, they cannot contain XSS attacks. Subspace [80] provides a cross-domain

communication mechanism using a small javascript library.Subspace divides a site into

subdomains. A subdomain can be used to source scripts from other domains. Cross-sub

domain channel is setup by setting the document.domain of the two subdomains to a com-

mon domain postfix. However such an approach is cumbersome tothe mashup developer,

when many sources are involved. Browser enforced embedded policies (BEEP) [81] allows

a website to white list scripts i.e scripts are safe to run. While such a technique can combat

XSS attacks, it is still lacking in terms of abstractions foraccess control which could be

employed in sharing scripts. MashupOS [82] is perhaps the closest to the work presented

in this chapter. In MashupOS project, a multi-principal browser is built based on the ab-

straction<Sandbox> and<Opensandbox>. <Sandbox> is similar to the<Isolate> tag.

84


<Opensandbox> is similar to<Sandbox>, the only difference being that the enclosing

page can access the sandboxed content. They also envisage<ServiceInstance> as a unit of

abstraction which guarantees resource allocation and<CommRequest> for cross-domain

communication. The sharing in their model is explicitly done by adding a commrequest

agent while ours is closer to the process sharing in Unix likesystems.

5.5 Summary and Future Work

Contents sourced from various sites combined with asynchronous programming mod-

els (AJAX) have necessitated the need for browsers to becomemulti-principal. Browsers

need to be able to handle trust relationships between different sites and between entities on

the same sites. This chapter focuses on providing abstractions to deal with the protection

and sharing mechanisms, to improve browsers. This is a majorimprovement over today’s

browsers which employ an all or nothing trust relationship.Using a modified version of

the Konqueror browser, we showed how XSS attacks can be contained using the<isolate>

abstraction.

While these abstractions act as a defense against XSS and help build a robust mashup,

they still lack in terms of managing browser resources and fault containment. Browsers

need to act as defacto operating systems for executing client side components of web ap-

plications, providing isolation and methods of resource management and fault containment.

Such an architecture could perhaps help combat emerging attacks e.g. pharming and pup-

85


petnets [83]. Puppetnets employ large swathes of rogue websites which redirect requests

from web clients to third-party website, thereby creating adenial of service (DOS) like

phenomenon. Pharming attacks can be used for local subnet fingerprinting. For example,

a rogue website can include malicious javascript in its pageto scan a local subnet behind

a firewall and send scan results back. To deal with such attacks, a complete operating sys-

tem style resource management abstraction in the browser with isolated memory, display,

network resources and fault containment seems to be essential and deserves further study.

In such a browser, extraneous web connection, opened on behalf of the website could be

monitored and perhaps curtailed.

86

Chapter 6

On the Use of Anycast in DNS

There have been several targeted DDoS attacks on one or more of the thirteen DNS root

servers [6]. Such attacks are significant because the root nameservers provide an impor-

tant translation service, vital to the core functioning of the Internet. Therefore, an attack

on the DNS fabric tends to takedown the entire Internet, rather than specific websites as

is normally the case. As shown in chapter 2 of this thesis, botnets of today can include

hundreds of thousands of nodes, distributed all over the world. Hence, protecting the In-

ternet infrastructure against such large adversaries is important. Accordingly, to meet this

robustness criterion, anycast is widely deployed in DNS today [84]. The IP addresses of

many top level DNS nameservers correspond to anycast groups. Client requests sent to

these addresses are delivered by the Internet routing infrastructure to the closest replica in

the corresponding anycast group. DNS operators have deployed anycast for a number of

reasons: reduced query latency, increased reliability andavailability as well as resiliency

87

CHAPTER 6. ON THE USE OF ANYCAST IN DNS

to DDoS attacks. While it is generally agreed that the deployment of anycast in DNS has

been a positive step, no studies have been done to evaluate the performance improvement

offered by anycast. This chapter, which is drawn from our work [85], presents the first

comprehensive study in this area.

Specifically, we aim to answer the following questions: (1) Do servers deploying any-

cast experience smaller number of outages and what is the duration of these outages? (2)

How stable is the anycast server selection over time? (3) Does anycast reduce query laten-

cies? To answer these questions, we performed a measurementstudy using clients deployed

over PlanetLab [86], to measure the performance characteristics of four top-level servers

using anycast and compared it to a server not using anycast. In our study, we identified a

set of different anycast deployment strategies, that are currently used in practice. Thus, we

monitored servers that represent different points on the anycast design space to compare the

effects of these design choices. Specifically, we evaluate the effects of single vs. multiple

anycast addresses for a zone and global vs. localized visibility of the servers in the anycast

group. We also compared these servers against a hypothetical zone with the same number

of nameservers but where all the nameservers are individually addressable. By doing so,

we can directly compare anycast to the traditional zone configuration guidelines [87].

Our results can be summarized as follows: We found that for all the measured zones

and independently of the anycast scheme used, the deployment of anycast decreases aver-

age query latency and increases availability when comparedto centralized servers. Further-

more, our study shows that while the number of query failuresis relatively small (≤ 0.7%),

88


outages are long in duration (≈30% last more than 100 seconds), affected by long BGP

routing convergence times. Interestingly, we show that, even though the outage duration is

not affected by the anycast scheme, the frequency of the outages relates to the scheme used,

i.e. whether servers have local or global visibility. In addition we identified that the anycast

scheme determines the percentage of queries directed to theclosest anycast instance. This

value ranges from about 37% for servers with a few global nodes to about 80% for servers,

wherein all nodes are global. We also uncovered an inherent trade-off between increasing

the effectiveness of anycast in directing the queries to thenearest server and stability of

the zone itself. For servers that advertise all their anycast group members globally, clients

choose the nearest server most of the time. The negative effect though is that, in this case

the zone becomes vulnerable to increased number of network outages and server switches.

The rest of this chapter is structured as follows: We give a brief introduction to any-

cast in Section 6.1 and explain our measurement methodologyin Section 6.2. Section 6.3

presents the servers used in this study and provides the rationale for choosing them. We

present our results and compare the different anycast strategies in Section 6.4. In Section

6.5 we outline a novel technique for configuring anycast groups benefits that maximizes

redundancy and distributes load evenly among the members ofthe anycast group. Finally,

we present related work in Section 6.6 conclude in Section 6.7.

89


6.1 Background

Anycast, first described in [17], provides a service, whereby a host transmits a datagram

to an anycast address and the internetwork is responsible for delivering the datagram to at

least one, preferably the closest, of the servers in the anycast group. The motivation behind

anycast is that it simplifies service discovery. A host does not have to choose from a list of

replica servers, offloading the responsibility of forwarding the request to the “best” server

to the network.

Ra1

Rb1

Rb4

Ra4

Lw Le

Ra3

Ra2

Rb3

Route announcementsDNS Requests

Rb2

ASB

ASA

CbCa

Figure 6.1: Sample Anycast configuration

Since the benefits of anycast are largely derived from its implementation, we briefly

review how anycast is currently implemented in the Internet. In Fig. 6.1, the two serversLe

andLw are members of the anycast group represented by addressI. Each of these servers

(or rather their first hop routers) advertise a prefix that coversI1 using BGP [88] toRa4

1The prefix is usually a /20. This requirement emerges from thefact that advertisements for shorter

90


andRa3 in ASA. Each of these routers in turn propagate the advertisementsto their iBGP

peersRa1 andRa2. The process continues until the advertisements reach the egress routers

Rb4 andRb3 of ASB, where customersCa andCb are connected respectively. RouterRb4

chooses the advertisement from iBGP peerRb1 because the IGP distance toRb1 is shorter

than distance toRb2. This selection is usually calledhot-potatorouting because it causes

packets to exit the provider’s network as early as possible.The final effect of these choices

is that packets fromCa follow the pathCa → Rb4 → Rb1 → Ra1 → Ra4 → Lw. Similarly,

packets fromCb follow the right vertical path. It is evident from this description that the

combination of BGP hot potato routing inside an autonomous system and shortest AS path

routing across autonomous systems results in choosing the closest anycast server, closest

being defined in terms of IGP metric and AS hop length.

Operators can incorporate anycast in their DNS zones, i.e. the domains that are under

their administration, in a number of ways. For example, the operator can use one or mul-

tiple nameservers addresses (NS records in DNS parlance) each with a different anycast

address. Anycast prefixes can be globally advertised or their scope can be limited to the

immediate neighboring autonomous systems. Servers whose advertisements are scoped

are calledlocal nodeswhile nodes with no scoping are calledglobal nodes. Local nodes

limit the visibility of their advertisements by using theno-export BGP attribute. Peers

receiving advertisements with this attribute should not forward the advertisement to their

peers or providers. Scoping is used to support servers with limited transaction and band-

prefixes are not propagated by the routing infrastructure toreduce the size of the global routing table.

91


width resources and servers serving only local networks. Finally, the anycast prefix(es) can

originate from a single AS or the zone operator can bemultihomedso multiple ASes inject

the prefix to the global BGP table.

In addition to the anycast address, each server in the anycast group has a unique unicast

address. This address is mainly used for management purposes (e.g. zone transfers) and is

selected from prefixes that are different from the prefix containing the anycast address. To

ensure that the management interface is reachable even in case the anycast prefix becomes

unavailable (e.g. during a routing outage or DDoS attack on the anycast address) the routing

path to the anycast address is different from the path to the unicast address. The importance

of this fact will become clear in Section 6.4.4 where we investigate whether anycast leads

clients to the closest server.

6.2 Measurement Methodology

Our goal is to investigate the implications of using anycastin DNS and to compare the

performance benefits of different anycast configurations. The two primary factors affecting

the performance of anycast are:(I) the number and location of the anycast servers relative

to the DNS clients, and(II) the anycast scheme used. Specifically whetherscoping is

used and whetherone or moreanycast addresses are visible to the clients. To quantify

the relative benefits of each of these factors we used the following four types of server

configuration in our measurements, each representative of adifferent point in the anycast

92


design space:

(1) A server with one or more instances in a single geographic location: While this

case does not use anycast, we use it as a base case to explore the potential performance

improvements of using multiple geographically distributed servers. We chose the B-root

nameserver as a representative of this category.(2) A server using a single anycast address

for all its instances, with multiple instances in differentlocations. We used the UltraDNS

servers, which are authoritative for the .org and .info top level domains, as the representative

of this category. UltraDNS servers are members of two anycast groups TLD1 and TLD2,

with all the instances being globally visible.(3) A server using a single anycast address

for all its instances, with multiple instances in differentlocations, and with some instances

being globally visible and with some scoped in a local region. To investigate the effects of

the number and location of anycast group members on performance we chose two different

examples: the F-root nameserver and the K-root nameserver.(4) A set of geographically

distributed servers, each individually accessible via unicast. We used this case to evaluate

the quality of the routing paths provided by the network fabric connecting the anycast

servers to their clients. To enable a direct comparison withanycast, we want to keep the

number and location of the name servers constant. To do so, weused the F-root example,

but in this case clients send requests to the unicast addresses of the F-root group members.

Each client maintains an ordered list of all the servers based on their latency and sends its

queries to the closest server from its list. In case a server becomes unavailable, the client

tries the subsequent servers in its list until it receives a response.

93


Because DNS clients have no control of where their queries are directed, we need clients

in multiple locations to cover all the servers in an anycast group. For this reason, we used

the PlanetLab [86] testbed for our measurements. We collected data from the PlanetLab

nodes from September 19, 2004 to October 8th, 2004. At the time of our measurements,

there were approximately 400 nodes in PlanetLab contributed by universities and research

labs around the globe. The results presented in this chapterare based on measurements

from approximately 300 active PlanetLab nodes. As we already mentioned, the client

locations relative to the servers can potentially affect our measurements. Table 6.1 shows

the distribution of PlanetLab nodes based on their geographic location.

We ran a script on every PlanetLab node to send periodic DNS queries to each of the

DNS servers earlier mentioned. The query interval we used isuniformly selected from

[25,35] seconds. We used this interval to achieve sub-minute accuracy for outage durations

reported in Section 6.4.2. Our script records the query latency and the server name cor-

responding to the anycast instance answering the query. Thescripts uses “special” DNS

requests to retrieve the name of the server replying to a request sent to the anycast address

( [89] shows the request type for F-root).

Continent % of PL nodesSouth America 0.5Australia 1.8Asia 15.8Europe 16.7North America 65.2

Table 6.1: Distribution of used PlanetLab nodes around the world.

94


From a client perspective, DNS has to be always available andfast. To see how anycast

contributes towards these end-user requirements, we compare the selected anycast deploy-

ment schemes based on the following criteria:

QUERY LATENCY

Reduction of end-user delay is an oft quoted benefit of deploying anycast. To test

whether this claim is true, we measure the latency of requests sent to the monitored servers

and compare the results. Since anycast achieves this reduction using localization, we also

calculate the percentage of DNS queries that are in fact routed to the nearest anycast in-

stance.

AVAILABILITY

For a global infrastructure service such as the DNS, availability is a key issue. To

evaluate the impact of anycast on availability, we measure the number and duration of

outage periods. An outage period is the time during which clients receive no replies to

their requests. During these periods client name queries are not resolved. Since DNS

requests and replies use datagrams, in case of a timeout, we resend the request twice to

differentiate between dropped packets and real DNS outages. The beginning of an outage

period is marked by the consecutive loss of all three requests. The end of the outage period

is marked by the receipt of the first answer from a DNS server. The difference between the

end and the start of an outage period gives thelengthof the outage period.

95


CONSTANCY

Constancy measures the affinity of clients to a specific instance of the anycast group.

If a client switches from one instance to another, we say aflip has occurred. We use

the number of flips as a measure of constancy. We also calculated the amount of time

PlanetLab nodes are directed to the same server in the anycast group as a metric of the

stability of the anycast service. While constancy is not critical for DNS queries over UDP,

TCP transactions will be reset during server changes. Even though the percentage of TCP

transactions is small today, we expect it will increase in the future with the introduction

of DNSSEC [90]. Furthermore, these results can be extrapolated to provide an indication

whether longer transactions, such as bulk transfers over TCP, would be affected by server

changes.

6.3 Anycast Deployment Strategies

In this section, we present the configuration of the monitored anycast servers and the

distribution of requests from the PlanetLab clients to eachof these servers.

6.3.1 Multiple Instances, One site: B-Root

The B-Root server has 3 nodes (b1/b2/b3.isi.edu), all of which are located in Los An-

geles, CA. All the servers reside in the same network and therefore, this scenario is repre-

sentative of a multiple instance, one site server (case 1 in Sec. 6.2).

96


6.3.2 Multiple Instances, Multiple Heterogeneous Sites:

F,K-root

This is the case where an anycast server has multiple instances deployed in geograph-

ically diverse sites, with some instances being globally visible and the rest of them being

scoped in their local region. For our measurements, we selected two anycast groups that

follow this configuration: the anycast groups of the F-root and K-root nameservers.

Table 6.2 gives a complete listing of all the F-root clustersat the time of our measure-

ments. This list is publicly available at [91]. One can see from Table 6.2 that a high percent-

age(∼ 70%) of nodes are served by the F-root clusters PAO1 (Palo Alto) and SFO2 (San

Fransisco). This is because these two clusters, have been deployed as global nodes [92].

The rest of the clusters are visible locally and serve clients only within their communities.

Out of the 26 listed F-root clusters at the time of this study,PlanetLab nodes contacted only

16 F-root clusters. The reason for this behavior is that the unreachable nodes have local

scope and no PlanetLab nodes are located within their scope,because the AS path to the

global node is shorter than the path to the local node. Therefore the PlanetLab site chooses

to route requests to the global node instead of the local.

Like F-root, K-root consists of global and local nodes, albeit much smaller in the group

size. K-root consists of multiple clusters primarily concentrated in Europe. Table 6.3 gives

a complete list of K-root clusters and their reachability from PlanetLab nodes. Clusters

at Amsterdam and London have global visibility, while the remaining have local visibility

97


Cluster Location %PAO1 Palo Alto, CA, USA 38.5SFO2 San Francisco, CA, USA 32.1MUC1 Munich, Germany 4.9HKG1 Hong Kong, China 3.5LAX1 Los Angeles, CA, USA 3.4YOW1 Ottawa, ON, Canada 3.1LGA1 New York, NY, USA 2.8SIN1 Singapore 2.0TLV1 Tel Aviv, Israel 1.7SEL1 Seoul, Korea 1.6SJC1 San Jose, CA, USA 1.5CDG1 Paris, France 1.1YYZ1 Toronto, ON, Canada 1.1GRU1 Sao Paulo, Brazil 1.0SVO1 Moscow, Russia 0.8ROM1 Rome, Italy 0.7AKL1 Auckland, New Zealand -BNE1 Brisbane, Australia -DXB1 Dubai, UAE -JNB1 Johannesburg, South Africa -MAD1 Madrid, Spain -MTY1 Monterrey, Mexico -TPE1 Taipei, Taiwan -CGK1 Jakarta, Indonesia -LIS1 Lisboa, Portugal -PEK1 Beijing, China -

Table 6.2: List of the 26 F-root sites. The last column shows the percentage of PlanetLab

nodes served by each F-root cluster. Example of an F-Root server is SFO2a.f-

rootservers.net.

98


[93]. This explains the high percentage(∼ 97%) of PlanetLab nodes served by LINX

(London) and AMS-IX (Amsterdam).

Cluster Location %ams-ix Amsterdam, Netherlands51.6

linx London, UK 46.7denic Frankfurt, Germany 0.9grnet Athens, Greece 0.7mix Milan, Italy -qtel Doha, Qatar -isnic Reykjavik, Iceland -

Table 6.3: List of the 7 K-root sites.

6.3.3 Multiple Instances, Multiple Homogeneous Sites:

UltraDNS

This is the case where an anycast server has multiple instances in diverse geographic

locations, with all of them being globally advertised. We used two of the UltraDNS any-

cast servers as a representative cases of this type of configuration. We should point out

that while each of the instances could possibly peer with different ISPs, UltraDNS servers

happen to use the same ISP for all the instances of the same anycast server.

Due to the unavailability of the complete listing of UltraDNS clusters, we only consider

clusters that are reachable from PlanetLab nodes. The namesof the anycast instances,

returned in the response to specially constructed DNS queries, provide a hint to the cluster

name. For example, the nameudns1abld.ultradns.net suggests that the server

99


belongs to theabld (London) cluster. The location of these clusters can then beextracted

from the corresponding airport codes that show up in traceroute. Table 6.4 gives a list of

all the UltraDNS clusters reachable from PlanetLab.

While F- and K-root servers use a hierarchical setup of global and local nodes, Ul-

traDNS uses a flat setup, where BGP advertisements from all instances are globally visible

throughout the Internet. Thus, DNS requests are more evenlydistributed across UltraDNS

clusters, compared to the F-, K-root servers. Instances in Europe (abld) and Asia (eqhk)

serve a smaller percentage of nodes since fewer PlanetLab nodes are located in these con-

tinents (cf. Table 6.1). Even though UltraDNS nodes respondto both the TLD1 and TLD2

anycast addresses, the distribution of client requests across TLD1 and TLD2 is totally dif-

ferent. For example, whilepxpa receives 23% of the queries for TLD1 it receives only

7% of the queries for TLD2. To understand this behavior, we investigated whether DNS

queries directed from the same client to the TLD1 and TLD2 anycast addresses, are indeed

resolved by UltraDNS nodes belonging to the same UltraDNS cluster.

Cluster Location PercentageTLD1 TLD2

pxpa Palo Alto, CA, USA 23.1 7.5eqab Ashburn, VA, USA 20.4 10.4abld London, UK 15.6 -eqch Chicago, IL, USA 15.1 7.1pxvn Maclean, VA, USA 8.8 37.8isi Los Angeles, CA, USA 8.3 18.6

eqsj San Jose, CA, USA 4.5 18.6eqhk Tokyo, Japan 4.2 -

Table 6.4: The list of the 8 UltraDNS clusters reachable fromPlanetLab.

100


For a given PlanetLab node PLn, we denote the list of TLD1 and TLD2 clusters PLn

contacts, by vectorsl1 and l2 respectively. We define the correspondence (or similarity)

between these lists of clusters asl1 · l2, the inner product ofl1 and l2 vectors. A corre-

spondence of one implies that the lists are the same while a correspondence of zero implies

that the two lists are completely different. Intermediate values imply a non-empty intersec-

tion. For example, assume that a given PlanetLab node contacts clusterspxpa andabld

for TLD1 name resolution and clusterspxpa andeqab for TLD2 name resolution, then

l1 = [1, 0, 0, 1, 0, 0, 0, 0] andl2 = [1, 1, 0, 0, 0, 0, 0, 0] (following the order of clusters used

in Table 6.4). The correspondence between thel1 andl2 is then equal to:

1 · 1 + 1 · 0 + 0 · 1 + 6 · 0 · 0√2 ·√

2=

1

2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.1 0.2 0.3 0.4 0.5 0.6

Fre

quen

cy

Correspondence

Figure 6.2: Histogram of correspondence between TLD1 vs TLD2 clusters contacted byPlanetLab nodes.

As Figure 6.2 depicts, the majority of the PlanetLab nodes (> 71 %) have a corre-

spondence of zero, indicating that queries directed at TLD1and TLD2 anycast addresses,

originating from the same PlanetLab node are answered by different clusters. The benefit

101


of this configuration is that in the event of a network outage affecting one of the anycast

addresses, the other address can be used, thus ensuring uninterrupted DNS service.

The reason why PlanetLab nodes mostly pick different clusters for TLD1 and TLD2

name resolution, is that UltraDNS uses two different carriers for TLD1 and TLD2 BGP

advertisements. Data from Routeviews [94] and traceroutesfrom PlanetLab nodes to

tld1/tld2.ultradns.net reveal that traffic to TLD1 is mostly routed via ASN 2914 (Verio)

while traffic to TLD2 is mostly routed via ASN 2828 (XO communications). This means

that UltraDNS uses Verio for advertising TLD1 and XO for TLD2. The use of two different

providers for TLD1 and TLD2 also explains why in Table 6.4 theclustersabld andeqhk

receive zero queries. XO communications has no peering points outside North America

and so queries from PlanetLab nodes in Europe and Asia are routed towards a US cluster.

6.4 Evaluation

This section examines: (1) the query latencies for the monitored servers; (2) the avail-

ability of the monitored servers; (3) the affinity of clientsto the server they are directed to

and (4) the percentage of clients not reaching the replica server that is closest to them and

the additional delay incurred.

102


6.4.1 Response times

Table 6.5 presents the mean, median, and standard deviationof query latencies for the

monitored servers over the whole measurement period. Figure 6.3 shows the response

time CDFs of the various anycast schemes. The median provides a better indication of the

expected behavior, since it is not skewed by individual clients with very high latencies. The

first observation we can make from this table is that anycast provides a sizable reduction

in query latency compared to the B-root server. The only exception to this trend is K-root.

This is due to the fact that even though K-root has multiple servers, they are located in

Europe and the Middle East, while most of the PlanetLab nodesare in North America.

Second, TLD1 has the lowest latency, even though F-root has more deployed servers. The

reason is that only two of the F-root servers haveglobal scope and therefore client requests

may have to travel to a server that is further away. On the other hand, UltraDNS does not

use scoping and client requests are distributed among a larger set of geographically diverse

servers leading to shorter round trip times. Furthermore, the median latency for TLD1 is

lower than that of TLD2 since clustersabld andeqhk are not reachable for the TLD2

anycast address, as shown in Table 6.4. Therefore queries toTLD2 from clients in Europe

and Asia have to travel to the US.

The last two rows of Table 6.5 represent synthetic results derived from actual measure-

ments. The min{TLD1,TLD2} row represents the average query latency for clients that

choose the closest server between TLD1 and TLD2, to direct their queries to. Remem-

ber that UltraDNS, which is authoritative for the .org and .info top level domains, uses two

103


Nameserver Mean Median Std. Dev.(ms) (ms) (ms)

F-Root 75 70 85B-Root 115 95 121K-Root 140 121 104TLD1 96 54 207TLD2 104 85 237min{TLD1,TLD2} 69 51 173Hypothetical unicast 45 35 13

Table 6.5: Statistics of DNS response times

anycast addresses for these domains’ nameservers. So this row represents the best case sce-

nario where a client can measure the latency to each of the nameservers and subsequently

direct its queries to the closest nameserver. Indeed, clients based on BIND 9 exhibit this

behavior [95]. The last row of Table 6.5 shows the average latency of the hypothetical zone

where all the F-root servers are directly accessible by their unicast addresses and clients

forward their request towards the closest DNS server. The latency of this zone is lower

than F-root due toscoping. As we already mentioned, scoping leads clients to pick a server

that is further away since the announcements from servers with local scope that are closer

than the global server do not reach them.

TLD1 and TLD2 exhibit the highest variance in the response times across all measured

servers. This is due to two reasons: variability in the delayof the network paths and

variability in the load on the anycast server. As we already explained, UltraDNS anycast

addresses are globally announced. In Section 6.4.3 we show that this results in clients

experiencing a higher number of “flips” (i.e. server changes), and consequently higher

104


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 50 100 150 200 250 300 350 400 450

Time(ms)

Reponse time CDF for the various name servers

TLD2TLD1

F-RootK-RootB-Root

min{TLD1,TLD2}Hypothetical unicast

Figure 6.3: Response time CDF.

fluctuation in DNS response times. We also noticed periods ofintermittently high query

times followed by outages specific to theeqab cluster between Sep 30 and Oct 2 that

contributed to the high variability of TLD1 and TLD2.

6.4.2 Availability

Considering the reliance of most Internet applications on DNS, ensuring continued

availability is a prime requirement for top level name servers. Figure 6.4 is a histogram

of the percentage of queries which were unanswered by the monitored nameservers. As we

mentioned in Section 6.2, we retry individual unanswered queries twice and therefore the

results presented here indicate queries lost due to networkand server outages rather than

random packet loss.

For all the measured servers, the average percentage of unanswered queries is low (≤

105


0.9%). At the same time, the benefit of deploying servers in multiple locations is evident

from the fact that all anycast schemes perform better than B-Root. This is to be expected

since robustness generally increases with geographic diversity. This is the reason why F-

Root has smaller percentage of unanswered queries comparedto K-Root even-though both

of these servers use the same anycast scheme. There is however large variation between

the availability of the different anycast schemes, with F-root having overall half the losses

of TLD1.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

F-Root K-Root TLD1 TLD2 B-Root

Ave

rage

num

ber

of o

utag

es p

er d

ay

Zone

Figure 6.4: Percentage of unanswered queries by various servers.

We use the term “outage” to indicate a window of time when a node is unsuccessful

in contacting its DNS server. Figure 6.5 plots the CDF of the duration of outages for the

different servers. The first observation from the graph is that all outages last at least 20 sec-

onds because of the time granularity with which we send DNS requests. Second, outages

for the hypothetical unicast server have the shortest duration. Indeed some (20%-30%) of

the PlanetLab experienced no outages. The maximum outage time is around 100 seconds,

indicating that in the worst case, a client will get a response after contacting at most three

106


servers. At the same time, the mean outage duration is approximately 40 sec, two to three

times shorter from the other servers. The min{TLD1,TLD2} combined nameserver enjoys

the same benefit of shorter outage periods since clients can switch from a failed server in

one of the anycast addresses to a server in the other address.All the real world anycast de-

ployments exhibit similar distribution in outage time withF-root having the longest outage

periods. This reveals an interesting fact regarding anycast. Since anycast relies on Internet

routing, once an outage has occurred the recovery time is governed by the recovery time

of the network routing fabric. In fact,≈ 30% of the outages last more than 100 seconds.

This is a direct consequence of the results presented by Labovitz et.al regarding delayed

network convergence [96]. The outage recovery time is largely independent of the anycast

scheme used.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Seconds

Outage duration time CDF for the various name servers

TLD2F-Root

TLD1K-RootB-Root

Hypothetical Unicastmin{TLD1,TLD2}

Figure 6.5: CDF of outage duration

It appears counter-intuitive that F-root can have the smallest percentage of lost queries

and at the same time have the longest duration outages. However, outage duration is only

107


one part of the picture. It is also important to noteinter-outageinterval and the number

of outages which occur per server. Figure 6.6 shows that inter-outage intervals, that is the

amount of time between succcesive outages experienced by the same client. The findings

from this graph are encouraging as they show average inter-outage periods in the order of

days. At the same time, inter-outage periods for TLD1 and TLD2 are shorter than those for

F-root. This finding is supported by Figure 6.7 depicting theaverage number of outages

per day aggregated over all the clients. One can see that TLD1and TLD2 experience five

to eight times more outages than F-root. The reason why TLD1 and TLD2 have higher

percentage of unanswered queries, even though the durationof their outages is shorter is

that outages occur more frequently giving a larger total number of unanswered queries.

0

0.2

0.4

0.6

0.8

1

1 10 100 1000 10000 100000

Time (min)

Interoutage time CDF for the various name servers

F-RootK-Root

TLD1TLD2

B-Root

Figure 6.6: CDF of inter-outage duration

While at this point we don’t fully understand why UltraDNS experiences more outages

than F-root, we conjecture that this is due to two reasons: First, all UltraDNS clusters

are global. As a result, clients follow more different pathsto reach their servers and are

108


therefore more exposed to BGP dynamics when links fail. Second, TLD1 and TLD2 are

single-homed while F-root is multi-homed. As a result if thefirst-hop ISP of TLD1 fails,

all TLD1 clusters become unavailable. On the other hand, since F-root is multi-homed the

impact of any single ISP failure on the overall availabilityis smaller.

0

200

400

600

800

1000

1200

1400

1600

F-Root K-Root TLD1 TLD2 B-Root

Ave

rage

num

ber

of o

utag

es p

er d

ay

Zone

Figure 6.7: Number of outages observed by various servers.

6.4.3 Constancy

There is no guarantee that packets from a client will be consistently delivered to the

same anycast group member. As a matter of fact, given the implementation of anycast

outlined in Section 6.1, one expects that destinations willchange over time as routing adapts

to network changes. In this section we present our findings onserver switches (or flips) for

the monitored anycast servers. We classify flips into two categories:inter-clusterandintra-

cluster. An inter-cluster flip happens when consecutive client requests are directed to two

different geographic clusters and is due to BGP changes. Each of these clusters contains

109


multiple DNS servers and an intra-cluster flip happens when the same client is directed to

different members located inside the same cluster. Inter-cluster flips are due to local load

balancing at the anycast cluster. As we saw in Sec. 6.4.1, therate of flips affects the query

latency variance. Delay consistency is more sensitive to inter-cluster flips than intra-cluster

ones, because inter-cluster flips involve a change of transit route, and different routes may

have widely different delay characteristics.

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

F-Root K-Root TLD1 TLD2

Per

cent

age(

%)

Nameserver

Figure 6.8: Number of flips observed as percentage of the total number of queries sent to

each nameserver.

Figure 6.8 provides a histogram of the number of inter-cluster flips observed in various

servers. Inter cluster flips in anycast deployments using global and local servers mostly

occur between the global servers. The majority of the flips(> 90%) for F-Root are between

thePAO andSFO global clusters, while for K-Root betweenAMS andLINX. Furthermore

the total number of inter-cluster flips observed in the F-Root and K-Root nameservers is

20% lower compared to TLD1 and TLD2. We believe the reason forthis is that UltraDNS

anycast clusters are globally visible while the majority ofF-Root and K-Root are local

110


clusters. Therefore at a client gateway, BGP paths to a greater number of UltraDNS clusters

are available compared to F-root clusters. Hence, UltraDNSserver selection is more prone

to BGP changes (due to path failures).

Nameserver Flips linked to an outage(%)

F-Root 65K-Root 63TLD1 52TLD2 51

Table 6.6: Percentage of flips due to outages.

Flips and outages can often be correlated. For example, on Sep. 21st (cf. Fig. 6.11), a

considerable number of PlanetLab nodes faced outages in theservice from theSFO cluster

of the F-Root server. After a brief outage of over a minute, service resumed with nodes

contacting thePAO cluster for F-root name resolution instead. Similarly, on Sep. 27th for

the K-Root server, all PlanetLab nodes using theAMS cluster experienced an outage. After

a brief interval spanning over two minutes, all these nodes flipped to theLINX cluster.

However, flips need not necessarily occur immediately afteroutages. To investigate how

strongly flips are correlated to server outages, we counted the number of flips that are

linked to an outage. When a client flips to a different server after the server it was using

becomes unavailable and later flips back to the original server, we say that these two flips

are related to the server outage. As Table 6.6 shows, in the case of TLD1 and TLD2

UltraDNS servers, the occurrence of flips and outages are related to a lesser extent. Since

UltraDNS clusters are all global nodes, flips are more frequent and half of the time occur

111


independently of outages. We believe two causes are behind the remaining flips: path

changes in Internet routing and path failures recovered by the routing infrastructure within

the inter-query interval (25-35 seconds).

The percentage of flips across all the servers is very small, indicating that they offer

a stable service. We are also interested in the time that PlanetLab nodes remain stable to

the same server. We found that there is a range of 5 orders of magnitude in this metric!

As Figure 6.9 illustrates, while the mean time a node remainsstable to the same server is

around 100 minutes, the lowest 10% of the nodes change servers every 1 minute, while the

most stable clients consistently choose the same server fordays or weeks. This behavior is

evidence that a small number of network paths are very stablewhile most other paths suffer

from outages and a small percentage of paths have a pathological number of outages.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 1 10 100 1000 10000 100000

Stability time (Minutes)

CDF of stability time

F-RootK-Root

TLD1TLD2

Figure 6.9: Period of time that PlanetLab nodes query the same server for the monitoredservers.

Furthermore, for servers that use global and local nodes (i.e. F- and K-root), we inves-

tigated if global nodes offered more stable service or vice versa. As Figure 6.10 indicates,

112


global nodes are more prone to switches as we already mentioned. We believe the reason

for this behavior is that the network paths to global nodes are longer and therefore more

prone to BGP dynamics.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Stability time (Minutes)

Clusterwise CDF of stability time

F-Root LocalF-Root GlobalK-Root Local

K-Root Global

Figure 6.10: CDF of the cluster stability of F-Root and K-root.

Until now we have only discussed the wide-area load balancing aspect of anycast and

how it is affected by BGP route changes. Load balancing also occurs inside clusters, to

distribute queries among the individual servers that make up the cluster. F-root uses IGP-

based (OSPF) anycast for load balancing [89], but other configurations could use hardware

based load balancers. Load balancers use either a per-packet or a per-flow mechanism. To

discover the load balancing scheme in the nameservers, we use to our advantage the fact

that each PlanetLab sites contains multiple nodes. These nodes can be expected to contact

the same anycast cluster. The similarity between the anycast servers of single site nodes

provides a hint to the type of load balancer used within each cluster. Large correlation

between the servers contacted by the nodes of the same site, indicates a per-packet load

113


0

20

40

60

80

100

120

Sep/20Sep/22Sep/24Sep/26Sep/28Sep/30Oct/02Oct/04Oct/06Oct/08

Tot

al n

umbe

r of

out

ages

per

day

Date

(a) Timeline of total hourly F-root out-

ages.

0

20

40

60

80

100

120

Sep/20Sep/22Sep/24Sep/26Sep/28Sep/30Oct/02Oct/04Oct/06Oct/08

Tot

al n

umbe

r of

Inte

r-cl

uste

r fli

ps

Date

(b) Timeline of total hourly F-root flips.

Figure 6.11: Correlation of outages and flips for the F-root server. A similar correlationwas observed for the K-root server.

balancer (given a round-robin load-balancing scheme we expect that packets from each

client will be sent to all the servers inside the DNS cluster). On the other hand, low correla-

tion indicates flow based load distribution (a technique often used by load balancer hashes

the clients’ source addresses). Using this technique we discovered that all the candidate

nameservers used a flow based technique, except for the B-Root server which used a per

packet load balancer. We observed that the B-root server faced a flip every half a minute.

This is typical of a per-packet load balancing technique, where successive data packets are

sent to different servers without regard for individual hosts or user sessions. The other

servers listed in the figure experience negligible number ofintra cluster flips. Even in the

hash-based flow sharing case, intra cluster flips may occur due to variations such as OSPF

weight changes or equipment failures. In general, flow basedhashing is preferred over

per-packet load balancing as it consistently directs packets from a single client to the same

cluster member.

114


6.4.4 Effectiveness of Localization

As our earlier results indicate, anycast decreases query latencies by localizing client

requests amongst the various DNS server replicas. However,comparing the F-root query

latency to that of the hypothetical zone where all the servers are individually addressable

(Table 6.5) seems to suggest that anycast does not always pick the closest server. This

raises the interesting question: Does anycast always lead clients to the closest instance

among all the servers in the anycast group? If not, how much farther away is the selected

server as compared to the closest? Anycast server selectiondepends on the path selected

by BGP. These routing decisions are influenced by policies and sub-optimal heuristics such

as using the path with the shortest AS hop count and can therefore lead to suboptimal

choices. In fact, it is well known that in many cases the pathschosen by BGP are not the

shortest [97,98].

An Optimistic Estimate: Directly comparing the query times of requests sent to the

unicast addresses of all the anycast group members, to the query time of the requests sent

to the server selected by anycast is potentially flawed due toa subtle reason. As we pointed

out in Section 6.1, the unicast addresses of the DNS servers are selected from address

ranges that are different from the one used for anycast. Therefore the path from a client to

the anycast address can be different from the path to the unicast address of the same server.

We use the following technique to get around this difficulty.Our technique is based

on the fact that if traceroutes from a client to the last hop router and the anycast address

follow the same path, we can obtain a good approximation of the round trip times incurred

115


by a client query to each of the different clusters by using the round trip time to the last

hop router instead. Using traceroutes from the PlanetLab nodes, we found that this was

indeed the case for the F-Root and TLD2 servers, but not so forTLD1 and the K-Root.

Figure 6.12 presents the additional network latency incurred by clients following the path

to the server selected by anycast over the path to the closestserver. One can see that in both

cases the majority of the anycast queries contact their nearest cluster. About 60% of all the

F-Root requests are sent to the nearest F-cluster and 80% of the TLD2 requests are sent to

the nearest TLD2 cluster. It must be however be noted that this is an upper bound on the

optimality of the anycast path choice for F-root, as not all the anycast clusters are visible to

the PlanetLab nodes (c.f. Table 6.2)

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250

RTT (ms)

CDF of additional round trip time

F-RootTLD2

Figure 6.12: Additional round trip time for client queries to the anycast-selected F-root

and TLD2 servers over the closest servers.

A Pessimistic estimate: We also measured the effectiveness of localization using an-

other approach, which yields a lower bound on the effectiveness of localization. First,

116


we calculate the geographic distance of each of the PlanetLab nodes to all the listed DNS

clusters in a zone. We do so by calculating the length of a hypothetical straight line over

the globe connecting the geographic locations of the PlanetLab node and the DNS server.

The locations of PlanetLab nodes are available through the PlanetLab website. Then, we

compare these geographic distances and determine whether the PlanetLab node contacts

the geographically closest server in that zone. While it is known that Internet paths are

longer than the direct geographic path connecting two end-points [98, 99], we assume that

all paths exhibit the samepath inflationfactor. Based on this assumption, we can directly

compare geographic distances to determine whether the bestInternet path is selected for

each client.

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80 90 100 110 120

Additional Distance (ms)

CDF of additional distance traveled

F-rootTLD1TLD2

K-root

Figure 6.13: Additional distance over the optimal traveledby anycast queries to contact

their F-root, K-root, TLD1 and TLD2 server.

Figure 6.13 presents the cumulative distribution of the additional distance across all

PlanetLab nodes for each zone. We observe that about 37% of all the anycast requests are

117


sent to the nearest F-root server while 35% of the anycast requests are sent to the nearest

K-root server. Approximately 75% requests are served by the nearest TLD1 and TLD2

servers. In fact, the CDF for TLD2 closely matches with that in Figure 6.12. However, this

is not the case with F-Root, because not all the clusters are visible from Planet Lab and

consequently not accounted for in Figure 6.12.

Using these two estimates, we can conclude that the effectiveness of localization for

the F-Root is between 37%-60%,≥ 35% for the K-Root, while for TLD1, it is≥ 75%

and for TLD2 between 75%-80%. It is not surprising that TLD1 and TLD2 zones perform

considerably better than other deployments. Not only a larger portion of nodes contact the

closest server but the additional distances for those that don’t, are also shorter. The reason is

that UltraDNS clusters are not differentiated into global and local. Consequently, PlanetLab

nodes have visibility to a greater number of BGP routes to UltraDNS clusters. Therefore,

it is more likely that anycast chooses the nearest UltraDNS cluster. In a somewhat counter-

intuitive way, the slowest 10% of TLD1 clients follow worse paths compared to TLD2

even-though TLD1 is advertised by two additional locates (London, Tokyo). We explain

this behavior by an example. Consider a client in Asia. If it doesn’t pick the HK site

for TLD1, its requests are directed to the US. Thus the large additional distance. On the

other hand, TLD2 is not advertised locally from HK and therefore clients correctly pick the

US sites. The inverse effect is visible for K-Root. Clients don’t traverse large additional

distances compared to the closest cluster due to the fact that all their clusters are located

within a relatively small geographical area.

118


6.4.5 Comparison of Deployment Strategies

Our study shows that we can categorize the existing anycast configurations into two

schemes: hierarchical and flat. The hierarchical scheme distinguishes anycast nodes into

local and global, while in the flat scheme all the nodes are globally visible. Anycast servers

in the flat configuration tend to have a more uniform distribution of load. Also, since there

is a greater diversity of choices of available anycast servers to a client, the distance between

clients and DNS servers is generally shorter, as seen in Section 6.4.4. Consequently, ma-

jority of the clients also have low query latency as reflectedin the low median query time

of TLD1 and TLD2 anycast servers, in Section 6.4.1.

However, in Section 6.4.2 we show that the flat scheme is more prone to outages. Even

though the outage durations follow similar distribution for both schemes, given that it is

a function of the BGP convergence time, the frequency of the outages is lower for the

hierarchical scheme. That is possibly due to the fact that inthe case of the flat scheme

more instances are globally visible in the routing tables, and thus they can potentially lead

to path changes triggered by other network events. Furthermore, in Section 6.4.3 we show

that having a large radius of advertisements has an adverse effect on the stability of the

response times and increases the frequency of server changes (flips) of the anycast service.

This is because the larger the radius of advertisements is, the greater is a server’s sphere

of influence. This consequently increases the number of choices of servers available at a

client.

We believe that an ideal anycast scheme would involve deploying a small number of

119


global nodes accompanied by a larger group of local nodes. The radius of advertisement

of the local nodes can be dynamically varied in order to maintain a minimum degree re-

dundancy and fast failover. In the section that follows we sketch out how such a dynamic

scheme could be implemented and evaluate its performance via simulations.

6.5 Effect of Advertisement Radius

Here, we investigate the relation between varying the advertisement radius of a server

and its effect on the load on a particular server and the anycast query latency. We simulate

the AS level topology of the Internet using the connectivitydata available from Route

Views [94]. Server placement is done based on the actual placement of F-root servers

available from [91]. An initial advertisement radius is assigned to each of the servers. If a

server has a radius ofr then its prefix advertisement is visibler AS hops from the origin AS

of this server. Finally we position 200 clients randomly across the set of all autonomous

systems. While we understand that this setup is not a true representation of the distribution

of DNS clients over the Internet, it nevertheless serves ourgoal of studying the effect of

advertisement radius on the load experienced by the servers. Each client selects the server

with the shortest AS path among all the visible paths.

120


Algorithm 1: Radius adjustment algorithm

(1) Radius[1 . . . Num of servers]← 5.

(2) Calculate Redundancy

(3) S ← server with max load.

(4) while Redundancy>= 1 do

(5) Radius[S] = Radius[S]− 1

(6) Calculate load on each server.

(7) Calculate Redundancy

(8) S ← server with max load.

We use the term “redundancy” to denote the minimum number of servers that is reach-

able by any client. At the beginning of the simulation, we fix the radius of all servers to

be equal to a sufficiently large value (we used an initial radius of five). We then gradually

reduce the radius of the server with the maximum load, thus confining the server to serve

smaller communities using AlgorithmWe iterate this process until there exists some client

which is outside the sphere of influence of all the servers, i.e. it has redundancy of zero.

Figure 6.14 plots the load on the maximally loaded server as afunction of the average

radius. Initially, when each server has radius equal to 5, every client can reach at least ten

servers, while the busiest server serves 80% of the traffic. As we decrease the radius of

this server, its load decreases until another server becomes the maximally loaded server.

121


10

20

30

40

50

60

70

80

90

3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5%

load

Average Radius of announcement

R = 1

R = 2

R = 3R = 4

R = 10Maximum Load on a server

Figure 6.14: Variation of server load with varying server advertisement radius for a randomdistribution of 200 clients. Redundancy is denoted by R.

Intuitively, as the average advertisement radius decreases the maximum number of clients

served by a single host also decreases, thereby distributing excess load to other servers. At

the limit, the highest loaded server receives about three times the optimal load (if clients

were evenly distributed across servers). Based on this graph, we can see that an ideal

operational region exists where the maximum server load is low while redundancy is greater

than one. While this result is encouraging, it indicates that an adaptive mechanism is needed

to minimize server load while keeping adequate redundancy levels. As far as we know,

zones employing the global/local hierarchy don’t use such amechanism today.

We also calculated the average path length as a function of the radius presented in

Figure 6.15. Initially clients have to travel a distance of approximately two ASes to reach

their closest server. However, as the average radius decreases, the path length increases and

consequently query latency also increases. The step-wise increase in path length shown in

Figure 6.15 path length as is due to the nature of the Internetgraph. A small number of

ASes have extremely high degree and have very short distanced to the majority of the

122


2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5A

vera

ge P

ath

Leng

th

Average Radius of announcement

Average Path Length

Figure 6.15: Variation of average AS path length with changein the radii of the server fora random distribution of 200 clients.

other ASes [100]. As long as the radiusr of a DNS server located in one of these “hub”

autonomous systems is higher thand most of the clients are directed to this server. When

r < d then clients are directed to a more distant server and thus the average path length

increases.

6.6 Related Work

A number of existing studies have looked at the performance of the DNS infrastructure.

Danziget al. presented measurements related to DNS traffic at a root name server [101].

Their main result was that the majority of DNS traffic was caused by bugs and misconfig-

urations. Most of these problems have been fixed in recent DNSservers. Anycast was not

yet used for DNS name resolution back then. More recently, Brownleeet al. monitored

the DNS traffic from a large campus network and measured the latency and loss rate of

queries sent to the root nameservers [102]. Their main goal was to create a model of DNS

123


request/response distribution. Our results on average latencies and loss rates match those

presented in that study. Interestingly, the authors of [102] observed that query times show

clear evidence of multipathing behavior and conjectured that this is due to load balancing

or changes in server load. Anycast at the BGP level and withina cluster, is a key cause

of this observed multipathing. Panget al. [103] measured the availability of the individual

DNS authoritative and caching servers, and studied the different server deployment strate-

gies. [104] presents some early results on their DNS anycaststability experiment using a

large number of vantage points on the Internet. While this isprobably the closest peer

related work and our results generally agree, we focus on different anycast deployment

strategies, and how they affect the performance of the anycast servers points spread around

the Internet.

Junget al. measured the performance of all DNS requests sent from the MIT campus

and investigated the effect of caching on DNS performance [105]. Wessels et al. com-

pared the effect of different caching techniques on the rootnameservers [95]. Their results

show that some caching servers favor nameservers with lowerround-trip times while others

don’t. This indicates that the use of anycast benefits at least some resolvers since it transpar-

ently leads them to (approximately) the closest instance. On the other hand, resolvers that

actively select the DNS server with the closest distance would see a performance benefit if

the unicast addresses of the servers were exposed as we showed in Section 6.4.1.

The effectiveness of anycast in providing redundancy and load sharing has been ex-

ploited in a number of proposals. The AS112 project reduces unnecessary load on root

124


nameservers by directing queries for local zones to a distributed black hole implemented

via anycast [106]. The use of anycast has also been proposed for finding IPv6 to IPv4

gateways [107] and to implement sink holes for the detectionand containment of worm

activity [108]. Engel et al. provide results from their measurement of load characteristics

on a set of mirror web sites using anycast [109]. Hiteshet al. present a scalable design for

anycast and use a small subset of the PlanetLab nodes to measure the affinity of existing

anycast deployments [110]. While this work has some similarity to ours, their focus is on

the design of an anycast scheme. Finally, a number of proposals have looked at alternatives

to the existing DNS architecture with the goal of improving query performance [111,112].

6.7 Summary

In this chapter, we presented an analysis on the impact of anycast on DNS based on the

measurement of five top-level servers. We found that overall, the deployment of anycast

is beneficial for the DNS infrastructure since it decreases the average query latency and

increases the availability of the DNS servers. However, ourstudy shows that while the

number of outages is relatively small, some of them are long in duration (≈ 30% last more

than 100 seconds), affected by BGP routing convergence times. Moreover, we identified

two different anycast schemes, currently deployed in DNS, and we show that these dif-

ferent deployment strategies play a key role in determiningthe optimality and robustness

of anycast. Finally, we uncovered a trade-off, in which increasing the number of globally

125


visible nodes increases the percentage of queries being directed to the closest cluster, but

at the same time de-stabilizes the service offered, in termsof increased server switches and

unanswered queries.

While this trade-off is clear from the results presented here, we don’t fully understand

the underlying mechanisms that connect the scope of BGP advertisements, the rate of flips,

and the duration of outages. To do so, would require access tothe BGP advertisements at

each monitoring point that were unfortunately unavailable. We are currently developing

a theoretical model for the effect of link failures on service outages that we plan to vali-

date via simulations. We believe that this model, coupled with access to the actual BGP

advertisements, will provide deeper insight into the operation of anycast and the trade-offs

involved.

Acknowledgements

Joe Abley graciously responded to our queries regarding theimplementation of anycast

in the F-root servers. We would also like to thank Lixia Zhang, Claudiu Danilov and

Alexandros Batsakis for their valuable comments.

126

Chapter 7

On the Effect of Router Buffer Sizes on

Low-Rate Denial of Service Attacks

Internet routers employ queues to buffer packets during periods of congestion. Until

recently, the size of buffers for TCP dominated links was determined using the rule of

thumb proposed by Villamizaret al. in [113]. According to this rule, the sizeB of a

buffer is given byB = RTT × C, whereRTT is the average round-trip time of the flows

traversing the link andC is the link capacity. While this rule of thumb was widely accepted,

Appenzelleret al. recently showed, based on TCP flow de-synchronization dynamics, that

queue size can be actually reduced without sacrificing utilization [19]. GivenN flows, they

show that a buffer of sizeB′ = RTT×C√N

, suffices to maintain utilization close to 100% for

drop tail queues. Since this result depends primarily on thede-synchronization of TCP

flows sharing the same queue, it is believed to extend to otherqueuing schemes such as

127

CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK

RED [114].

RED was the first in a series of Active Queue Management (AQM) schemes that use

increases in queue size to detectincipient congestionbefore the queue becomes full. Sub-

sequent extensions to RED, e.g. RED-PD [21] etc., attempt toachieve a fair allocation of

resources among potentially selfish or malicious flows sharing the same link. Malicious

flows may violate the TCP congestion control algorithm in order to selfishly maximize

their throughput or cause denial of service (DoS) attacks, thereby minimizing the through-

put received by TCP flows sharing the same link. Since the majority of AQM schemes

maintain partial flow state for reasons of scalability, larger buffer sizes translate to more

accurate per-flow statistics and therefore higher probability of detecting misbehavers. This

brings us to the main question we address in this chapter:While buffer size can be reduced

without affecting link utilization, does this reduction make the detection of misbehavers

harder?To test the vulnerability of smaller buffer queues to misbehaving sources, we use

a recently proposed class of DoS attacks calledshrews[20]. These malicious flows send

short periodic bursts of traffic trying to fill up the buffer and force TCP timeouts, thus throt-

tling the throughput of TCP flows. We chose this type of attackbecause shrews are difficult

to detect due to their low average sending rate.

We use a mathematical model to show that smaller queues are indeed vulnerable to

shrew attacks. However, increasing the buffer toB′′ = mB′,m <<√

N , is sufficient

to drive the shrews’ average transmitting rate required to cause a DoS attack considerably

higher than the min-max fair rate. When this happens, shrewscan be detected by an AQM

128


scheme such as RED-PD and consequently penalized without affecting compliant flows.

We validate our analysis using simulations in two differentscenarios: (a) a 10 Mbps link

shared by 20 flows and, (b) a 155 Mbps link shared by 250 flows.

The rest of this chapter is structured as follows: We briefly introduce shrew attacks in

Section 7.1. Section 7.2 provides a mathematical analysis of the effect of increasing buffer

size on the sending rate of the shrews. Validation of the analysis through simulations is

shown in Section 7.3. Related work is presented in Section 7.4 and we conclude in Section

7.5

7.1 The Shrew Attack

We begin with a brief description of the shrew attack. A detailed discussion on shrews

can be found in [20]. Consider a bottleneck link shared by a large number of TCP flows. A

low-rate shrew DoS attack is a periodic burst of traffic (e.g.square wave pattern) such as

the one shown in Fig. 7.1. The shrew transmits at a high rate ofP bps, for a short period

of time l sec. For the rest of the time, it transmits at a much lower rate(almost zero). This

behavior repeats with a period ofT sec. The average rate of a typical shrew is given by by

P ∗ l/T . Since the ratiol/T is small, the shrew appears to be a well-behaved over larger

timescales, thus evading detection.

The shrew works by keeping the buffer full for a sufficiently long time (typically in

the time scale of the flows’ RTT), causing the router to forcefully drop multiple packets

129


Pea

k R

ate

(P)

Sen

ding

Rat

e

Time

Burst Time l( )

Period T( )

Figure 7.1: Square-wave shrew

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 50 100 150 200 250 300 350 400 450

Thr

ough

put (

Nor

mal

ized

)

RTT(ms)

No ShrewShrew

Figure 7.2: Effect of a single shrew on TCP throughput as a function of the RTT of flowssharing a DropTail queue.

from the same TCP flow. At this point, TCP flows will try to retransmit the packets after

a retransmission timeout (RTO). By setting its periodT to be equal to the TCP flow’s

minRTO (the authors of [115] suggest that all TCP flows shouldset their minRTO to 1

second) the shrew causes the retransmitted packets to also be dropped. Subsequently, TCP

does an exponential back-off, dropping its congestion window to one and doubling its RTO.

Since the new RTO is also a multiple ofT , the flow experiences repeated packet losses.

Typically, the lower RTT flows are penalized more heavily than the higher RTT ones. As

130


N The number of flows sharing the link.RTT The average RTT of the flows.C The link capacity.B The buffer size given by

B = m · RTT ·C√N

, m ≥ 1.B0 = γ · B The instantaneous queue size when the

shrew attack is launched.s The number of shrews.P The burst rate of a single shrew.l The burst time.T The period of the shrew.

Table 7.1: Notation used in the mathematical analysis of theshrew attack.

Figure 7.1 shows (recreated from Figure 7 of [20]) shrews canconsiderably decrease the

throughput of competing TCP flows.

7.2 Mathematical Analysis

We present here a simple fluid model used to analyze the effectof increasing buffer

size on shrew attacks. In our analysis we assume an idealizedAQM scheme that is able to

detect and penalize flows sending traffic at rate higher than their fair rate. The notation we

use is listed in Table 7.1.

Consider an attack withs shrews, is launched on a link used byN TCP flows. We

further assume that the goal of the shrew attack is to limit all the flows with RTT< ρ sec.

Then, the minimum amount of total incoming traffic required to keep the buffer full forρ

sec, is given by:

131


Input traffic= B − (B0 − C · ρ) = C · ρ + (B − B0) (7.1)

As shown in [20], if a shrew attack is launched on a link sharedby a large number of

TCP flows, and the shrew attack limits all flows with RTT< ρ, TCP flows with larger RTT

may consume the additional capacity. Furthermore, other background traffic sharing the

link such as short TCP and UDP flows, which are unaffected by the shrew, also aid the

shrew in filling up link. Consequently, the shrews need to send less traffic than shown in

Eq.(7.1). In the worst case, when the link is completely utilized by background traffic, the

shrews must at least account for(B − B0). Therefore, for a time periodρ:

Shrew traffic≥ B −B0 = m · (1− γ) · RTT · C√N

(7.2)

whereγ is the fraction of the buffer that was full at the beginning ofthe shrew attack.

The fractionγ depends on factors such as the queue type and the traffic mix traversing the

link. Note that the total traffic sent by the shrews during this time is equal to(P ∗ l) ∗ s.

Thus, we can rewrite Eq.(7.2) as:

(P · l) · s ≥ m · (1− γ) · RTT · C√N

(7.3)

One can see from Eq.(7.3) that if we increase the size of the buffer by using a larger

constantm′ = m + ∆m, the peak rate of each shrew must increase by:

132


∆P ≥ ∆m · (1− γ) · RTT · C√N · l · s

(7.4)

Eq.(7.4) reveals that with a unit increase in the multiplicative factorm, each individual

shrew needs to increase its sending rate by an order ofO(1/√

N). Given that the fair

bandwidth of a flow isfbw = O( CN

), a small increase inm, causes the sending rate of each

shrew to be higher thanfbw, whereby the shrew is no longer a low-rate attack and will

therefore be detected by the AQM mechanism. Note that for high speed links,∆m <<

√N , and so the buffer size still remains<< RTT · C. Furthermore, asN increases, the

fair bandwidth allotted to each flow decreases. Consequently, the average sending rate of

the shrew is much higher than the fair bandwidth and the shrewis easier to detect.

We use a typical scenario as an illustration. Consider an OC3link (155 Mbps) carrying

150 TCP flows, withγ = 0.7, RTT = 250 ms, andl = 100ms. In this case the additional

buffer space that needs to be filled by the shrews for a unit increase ofm, is ∆ = (1 −

γ) ∗ RTT∗C√N≈ 1 Mb. Therefore, ifs = 5, each individual shrew needs to increase its peak

sending rate by 2 Mbps for a unit increase inm. Consequently, the average sending rate of

a shrew increases by2Mbps ∗ l/T = 2 ∗ 0.1/1 = 0.2 Mbps. The min-max fair bandwidth

for the link is 155/150≈ 1 Mbps. Thus, choosingm = 5 (<√

150 ≈ 12.24) is sufficient

to drive the sending rate of a single shrew sufficiently high,whereby the shrew will be

detected by an AQM scheme (e.g. RED-PD) which provides approximate fairness.

133


7.3 Evaluation

We use ns-2 simulations to verify our mathematical analysis. Figure 7.3 shows the

classic dumb-bell topology we used in our simulations, withtwo sets of sources and sinks:

the first set consists of TCP source/sink pairs while the second set consists of shrews. All

TCP flows are long duration SACK flows. We used SACK because it was found to be the

most resistant version of TCP to the shrew attack [20]. TCP sources start at a random time

between [0,10] sec while the shrew attack starts at 100 sec, to allow the TCP flows to reach

steady state.

All the source-sink pairs are inter-connected by the bottleneck linkr0 → r1. The links

delays and speeds are shown in Figure 7.3. All the sinks have aone-way delay of 1 msec

to routerr1. The one-way propagation delay of the TCP sources tor0, uniformly increases

from 0 to 220 msec. Therefore, the round trip time ranges uniformly from 20 msec to 460

msec as suggested in [116].

..

.

.

........

110ms

Sources Sinks

1000 Mbps 1000 Mbps1 ms

.

n

Shrew source/sink

TCP source/sink

r r0

..

.

Figure 7.3: Dumb-bell configuration.

We set the buffer size of ther0 → r1 link to be equal toB = m ∗ RTT×C√N

and we vary

134


m from 1 to√

N , to measure the effect of increasing buffer size on the throughput of the

TCP flows and the sending rate of the shrew. All the links have drop-tail queues except

ther0 → r1 link which uses RED-PD [21]. RED-PD uses a configurabletarget round trip

time R to derive the average sending rate of compliant TCP flows using the deterministic

model of TCP from [117]. According to this model, the sendingrate of a compliant TCP

flow is BR =√

1.5R√

p, wherep is the ambient loss rate computed over the recent history of

packet losses. Flows whose sending rate is higher thanBR are identified as misbehaving

and are monitored. The advantage of increasingR is that more misbehaving flows can

be identified. On the other hand, doing so increases the required amount of per-flow state

which is proportional to the increase inR and the number of flows traversing the link.

In our simulations we use multiple values ofR to evaluate the sensitivity of RED-PD in

detecting the shrews.

In the following paragraphs, we present the results from twodifferent scenarios, with

different link speeds and number of flows to investigate the effect of increasing buffer size

on the throughput of TCP and the transmitting rate of the shrews. All the results are based

on at least 400 sec of simulations.

7.3.1 Low Speed Link

This scenario is similar to the one used in the original shrewattack paper [20]. The

capacity of bottleneck link is set to 10 Mbps and link is shared by 20 TCP flows and a

single shrew. The RED-PD thresholdR is set to 40 msec. We use the shrew parameters

135


from [20], whereP = 10 Mbps, l = 200 msec, andT = 1.2 sec, for an average sending

rate of 1.67 Mbps. Given a shrew with parametersp, l, T , we define an equivalent CBR to

be a CBR flow transmitting at a constant rate ofpl/T , equal to the average sending rate of

the shrew. We then compare the normalized throughput as a percentage of the link capacity

that each TCP flow achieves when it competes with a shrew to thethroughput achieved

when the shrew is replaced by the equivalent CBR flow.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 50 100 150 200 250 300 350 400 450 500

Nor

mal

ized

Thr

ough

put

RTT(ms)

Shrew,m=1CBR,m=1

Shrew,m=4CBR,m=4

Figure 7.4: TCP throughput as a function of the RTT under increasing buffer sizes. Unlessotherwise specified,R = 40 msec.

Figure 7.4 plots the throughput obtained by the different TCP flows as a function of

their RTT. The first point to be noted in the graph is the well-known negative bias of TCP

to flows with high RTT. The more interesting point however, isthe reduced sending rate of

TCP flows across the whole RTT range when the shrew is active and the buffer size is small

(m = 1). The low RTT flows are more adversely affected. However, TCP sending rate

increases withm. Whenm = 4, the throughput of the TCP flows is approximately equal

to that, when the shrew is replaced by the equivalent CBR source. This result indicates that

136


the higher buffer size is indeed effective in minimizing theeffect of the shrew.

R=40 msec R=120msecm CBR Shrew CBR Shrew1 83% 43% 86% 79%2 86% 65% 87% 80%3 86% 78% 88% 81%4 86% 82% 88% 81%√

20 ≈4.5 86% 83% 88% 81%

Table 7.2: Aggregate link utilization from 20 TCP flows.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0 50 100 150 200 250 300 350 400 450 500

Nor

mal

ized

Thr

ough

put

RTT(ms)

Shrew,m=1CBR,m=1

Shrew,R=40ms,m=2Shrew,R=120ms,m=2

Figure 7.5: WhenR increases to 120 msec, it is possible to have small buffer size (m = 2)without penalizing the TCP flows sharing the link with the shrew.

Figure 7.5 shows the effect of a unit increase inm on the throughput of the TCP flows.

Whenm = 2, the throughput of TCP flows with low RTT increases. However, there is

still some negative effect of the shrews on the higher RTT flows. Table 7.2 shows the

percentage of link capacity utilized by the 20 TCP flows for different values ofR. The

Shrew column corresponds to the throughput obtained by the TCP flows under a shrew

attack while the CBR column shows the throughput for the equivalent CBR flow. As seen

from the table, the TCP throughput is gradually restored with increase inm. Whenm = 4,

137


the negative effect of the shrew on the TCP flows is minimal1. The table also shows that

using a largerR value (120msec) requires a smaller increase in the buffer size to mitigate

the shrew attack because RED-PD withR=120msec is a better fairness approximator than

RED-PD withR=40msec. The reason is that whenR = 40msec, RED-PD only detects

flows whose RTT≤ 40 msec. For the same reason, that the results (not shown here) for a

simple RED queue are similar to that of RED-PD withR=40msec. Since most of the flows

in this experiment have higher RTT, RED-PD emulates RED for the majority of the flows.

Consequently, whenR = 120 msec, the throughput attained by the TCP flows is closer to

the fair throughput allocation (C/N = 10/21≈ 470Kbps). Of course, settingR equal to

the maximum RTT among the TCP flows would achieve perfect fairness and completely

mitigate the shrews, with the least increase in the buffer size. The downside is that since

the amount of packet drop information stored by RED-PD is proportional to the number of

flows andR, largerR values result in higher state overhead.

From this first experiment, one may incorrectly suspect thatin order for the shrew to be

neutralized,m ≈√

N . This is however an artifact of the small number of TCP flows (20)

sharing the link in this experiment. To show that small values of m are indeed adequate,

we repeated the experiment using a higher number of flows on a faster link.

1Utilization of 86% and 88% for TCP whenm = 4.5, indicates that the competing flow is having highnumber of losses and the TCPs are able to utilize the additional capacity

138


7.3.2 High Speed Link

Next, we consider a more realistic scenario, where the bottleneck link is an OC-3

(155Mbps) link shared by 250 TCP flows. We use ten synchronized shrews (≈ 4% of

the total number of flows). This way any single shrew has a lower average sending rate and

is more difficult to detect. For each shrewP = 20 Mbps,l = 200 msec, andT = 1.2 sec,

implying an average sending rate of 3.33 Mbps. Therefore, all the synchronized shrews

have an aggregate peak rate of 200Mbps for a burst time of 200 ms. Table 7.3 shows the

link utilization due to TCP flows obtained with shrews and with shrews replaced by equiv-

alent CBR flows. As seen in the previous simulation, the utilization steadily increases with

m. The corresponding bandwidth plot in Figure 7.6, illustrates that asm increases to≈ 3,

the sending rate of the low RTT flows is restored to the no-shrews scenario. As in the pre-

vious subsection, we repeat the experiment withR = 120 msec. As expected, increasing

R increases the throughput obtained by the high RTT flows, thereby improving the overall

link utilization considerably. In this case, we see that with R = 120 msec andm = 5, we

achieve TCP utilization as good as whenm = 16. The same effect is evident in Figure 7.6.

R=40 msec R=120msecm CBR Shrew CBR Shrew1 82% 52% 81% 53%3 82% 62% 82% 58%5 83% 68% 83% 82%8 82% 70% 82% 81%12 81% 76% 81% 81%√

250 ≈16 80% 77% 80% 80%

Table 7.3: Aggregate TCP link utilization for 250 flows.

139


0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 100 200 300 400 500 600 700 800

Nor

mal

ized

Thr

ough

put

RTT(ms)

m=1m=3m=5

m=16Shrew,R=120ms,m=5

Figure 7.6: TCP throughput under increasing buffer sizes.

0

0.5

1

1.5

2

2.5

3

3.5

1 1.5 2 2.5 3 3.5 4 4.5

Nor

mal

ized

Ban

dwid

th

m for 10 Mbps,20 flows

Peak RateAverage Rate

0

0.05

0.1

0.15

0.2

0.25

0.3

2 4 6 8 10 12 14 16

Nor

mal

ized

Ban

dwid

th

m for 155 Mbps, 250 flows

Peak RateAverage Rate

Figure 7.7: Peak and Average Shrew sending rate needed to maintain reduced linkutilization.

Eq.(7.4) in Section 7.2 shows that for the shrew attack to be effective, the peak rateP

as well as the shrew’s average rate has to increase linearly with m. We verified this analysis

via simulation. The graphs in Figure 7.7 plot the shrews normalized sending rate, required

to maintain the same low link utilization as whenm = 1. We see that in the case of 20

flows, whenm = 3 the average sending rate of the shrew is approximately 50% ofthe link

capacity. For 250 flows, withm = 5, each shrew needs to send at about 5% of the link

140


capacity, i.e 50% (sinces = 10) of the link traffic must be shrew traffic. This implies that

the shrew traffic is no longer a low rate traffic but actually a high rate DDoS attack, even

with a relatively poor choice ofR = 40 ms. Unlike shrews, high rate DDoS attacks are easy

to detect and several schemes exist to contain them, e.g. [118]. The required shrew sending

rate increases more slowly whenm > 5. The reason for this is as follows: To maintain

TCP utilization equal to that whenm = 1 (≈ 50%), it is enough for the competing shrews

to fill up 50% of the link, irrespective of the buffer size. This can be achieved by sending

at a rate slightly greater than 50% of the link speed.

As seen from the last two subsections, a small increase inm is sufficient to improve the

throughput of the high RTT TCP flows to almost the levels of no-DoS attack. However,

there is still some effect of the shrew attack seen in terms ofthe reduced utilization of the

link capacity. As described in [21], setting the target bandwidth R of the RED-PD router to

be very large(≥ maxRTT ) guarantees perfect fairness, but the state to be maintainedis of

O(N ∗R), whereN is the number of flows and hence scalability is an issue. Basedon the

mathematical analysis and simulations setting a moderate target bandwidth ofR(≈ 120ms)

and increasingm by a small value(≈ 5) provides comparable performance with the no-

shrew scenario, both in terms of utilization and fairness. We believe that the effect of

shrew attacks can mitigated by using this double pronged strategy. In order for DoS attacks

to throttle TCP traffic, shrews need to send at a considerablyhigh proportion of the link

capacity, whence the attack is no longer a low rate attack andeasier to detect.

141


7.4 Related Work

There has a been a plethora of AQM schemes inspired by RED, theseminal work in

this area [114]. Quite a few of the schemes aim to provide approximate fairness. However,

fairness comes at the cost of maintaining additional state at the router. To minimize state

overhead, AQM schemes use only partial flow state. Hence, it is possible for false neg-

atives to occur and malicious flows can exploit these weaknesses to gain undue resource

advantages. An example of such malicious flows, is theshrew attack, a low-rate TCP tar-

geted denial of service attack [20]. Various counter-measures to the shrew attack have been

proposed. The notable being:(a) RTO randomization; however [20] argues that shrews

can still filter out portions of TCP traffic.(b) Router level DDoS solutions such as IP trace-

back [119] and Pushback [118], and(c) AQM modifications such as HAWK [120], which

makes the policy decision to penalize all bursty flows. However, not all bursty traffic is

malicious. For example, short TCP flows would be unduly penalized by HAWK due to

their bursty nature. Rather than proposing a new AQM scheme we show that a moderate

increase in buffer size, coupled with the use of RED-PD is sufficient to minimize the impact

of shrew attacks on TCP traffic.

7.5 Summary

This chapter studies the effect of buffer sizes on the power of low rate TCP targeted

DoS attacks to “shut-down” competing TCP traffic. Using a simple mathematical analysis

142


coupled with simulations we showed that a relatively small increase in the buffer size can

mitigate the effect of shrew attacks on TCP traffic. The intuition behind our result is simple:

As the buffer size increases, shrews need to fill a larger buffer to cause multiple TCP packet

drops. This means that the shrews need to transmit at high speeds, at which point they are

no longer low-rate attacks and can be detected by existing AQM schemes such as RED-PD.

Acknowledgements

We thank Razvan Musaloiu-E. for the discussions about RED-PD, and Kishore Kotha-

palli for patiently reviewing the mathematical analysis ofthe shrew attack.

143

Chapter 8

Future Work

The work described in this thesis, contributes towards the goal of countering emerging

malware threats and quantifying the robustness of the Internet in withstanding attacks. In

general, we believe the task of ensuring a secure system is a challenging one. Intrusion

agents such as worms, trojans etc. are constantly evolving into more developed forms

and using emerging technologies as delivery vehicles. In this sense, security research can

be likened to an arms race, with miscreants pitted against security providers. To remain

a step ahead, a pro-active approach to forecasting and countering threats is essential. In

accordance with this stance, we now discuss avenues for future research, using insights

from the work in this thesis and what needs to be accomplished.

144

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

8.0.1 Botnets

The past few years have seen a shift from the traditional IRC style botnets to decen-

tralized and stealthier architectures. As chapter 2 shows,the current P2P bot tracking

techniques are inadequate. Current practices allow miscreants to identify botnet track-

ers/monitors employing simple heuristics. Such a capability is dangerous, as botnets be-

come more powerful. As already witnessed, botmasters are not averse to carrying out DoS

attacks which effectively take down parties inimical to their own interests [13]. While co-

operatively monitoring botnets holds promise to effectively detect and track P2P botnets,

containment of the malware is still quite difficult. The onlysure way known to contain P2P

botnets today, is to deal with infected machines on an individual basis using an Anti-virus.

While network-centric techniques such as index poisoning work quite well to deter for file

sharing P2P searches, their effectiveness in the case of theautomated worm binaries is

debatable and needs to be researched further.

Another emerging trend in the botnet ecosystem is the use of HTTP for C&C com-

munication [121]. While, IRC/P2P botnet traffic is easily distinguishable at routers, and

accordingly filtered, traffic from HTTP botnets can camouflage itself in the milieu of other

web-based traffic. Furthermore, inspection of HTTP traffic is usually frowned upon in en-

terprises, due to concerns of privacy violations. DNS cachetimings and probes could come

in handy in determining outlier http servers, which are potentially malicious.

145


8.0.2 Mobile Malware

In chapters 3 and 4, we gave a detailed account of the evolution and possible detection

techniques for mobile malware. Since the technique of random moonwalks essentially

culls out worm edges in the presence of noisy background traffic, we believe it could be

engineered to be robust even in distributed, lossy scenarios, wherein not all the domains

co-operate. However, even with effective detection mechanisms, the feasibility of policing

nodes as they enter/exit domains is not straightforward. Specifically, the design of ahard-

LAN ( [59]) to a mobile setting is challenging and an avenue for future work.

A closely related field of research deals with smartphone worms, which can propagate

using a variety of different vectors e.g., SMS/MMS, bluetooth, Wifi etc. While anti-virus

(AV) software defenses do exist for smartphones, they are limited in their effectiveness due

to reasons of limited battery power, storage and more importantly the fact that AV software

depends on signatures. One can devise, collaborative worm detection procedures similar to

the moonwalk algorithm, to combat these threats. However, the peculiarity of smartphone

networks, is the predictable nature of traffic spikes. For example, there is usually a surge

in SMS traffic around the new year [122]. Worms can exploit this surge by propagating

during that period. Most intrusion detection mechanisms which rely on historical data to

define abnormal events, can be evaded by a worm employing thisstrategy.

146


8.0.3 Web based Malware

In recent years, the web browser has evolved from a single-principal, single-site ap-

plication to one in which a single page contains a mashup of code and data from multiple,

mutually distrusting sites. We introduced new abstractions in chapter 5 which could combat

attacks such as XSS. We believe that browsers still lack essential abstractions which could

provide an equivalent operating system style environment for websites within. Function-

ing as defacto operating systems for executing client side components of web applications,

provides isolation and methods of resource management and fault containment, while pro-

viding powerful sharing paradigms.

147

Bibliography

[1] “eEye Digital Security”, “Code Red Worm,” http://www.eeye.com/html/Research/

Advisories/AL20010717.html.

[2] R. Pang, V. Yegneswaran, P. Barford, V. Paxson, and L. Peterson, “Characteristics of

Internet Background Radiation,” inProceedingsof ACM IMC, October2004, 2004.

[3] E. Cooke, F. Jahanian, and D. McPherson, “The Zombie

Roundup: Understanding, Detecting, and Disturbing Botnets,” in

Proceedings of the first Workshop on Steps to Reducing Unwanted Traffic on the Internet,

Jul. 2005.

[4] M. A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis, “A Multifaceted Approach to Un-

derstanding the Botnet Phenomenon,” inProceedingsof ACM SIGCOMM/USENIX

InternetMeasurementConference(IMC), Oct., 2006, pp. 41–52.

[5] C. Nunnery and B. B. Kang, “Locating Zombie Nodes and Botmasters in Decen-

tralized Peer-to-Peer Botnets,” Available at: honeynet.uncc.edu/papers/P2PDetect

ConceptPaper.pdf, 2007.

148

BIBLIOGRAPHY

[6] “Dos attack cripples internet root servers,” Availableat: http://www.

informationweek.com/story/showArticle.jhtml?articleID=197003903.

[7] J. Stewart, “Storm Worm DDoS Attack,” Available at: http://www.secureworks.com/

research/threats/storm-worm, Feb. 2007.

[8] RSnake, “XSS Cheat Sheet,” Available at: http://ha.ckers.org/xss.html.

[9] F. Boldewin, “Peacomm.C - Cracking the Nutshell,” Available at: http://www.

reconstructer.org/papers/Peacomm.C-Crackingthenutshell.zip.

[10] P. Porras, H. Sadi, and V. Yegneswaran, “A Multi-perspective Analysis of the

Storm (Peacomm) Worm,” Available at: http://www.cyber-ta.org/pubs/StormWorm/

report/, Oct. 2007.

[11] P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer information system

based on the XOR metric,” inProceedingsof the Sixth InternationalWorkshop

on Peer-to-PeerSystems(IPTPS), 2002. [Online]. Available: citeseer.ist.psu.edu/

maymounkov02kademlia.html

[12] The Honeynet Project & Research Alliance, “Know Your Enemy:Fast-Flux Ser-

vice Networks, An Ever Changing Enemy,” Available at: http://www.honeynet.org/

papers/ff/index.html, Jul. 2007.

[13] “Storm Worm retaliates against security researchers,” Available at: http://www.

theregister.co.uk/2007/10/25/stormworm backlash/, Oct. 2007.

149

BIBLIOGRAPHY

[14] G. Research, “ Forecast: Mobile Terminals, Worldwide,2000-2009 (4Q05 Update),”

Available at http://www.gartner.com/DisplayDocument?doc cd=137396, Jan. 2006.

[15] “Zotob Causes Carnage in Corporate Networks,” available from: http://www.

netfastusa.com/xq/asp/id.1338/p.5-6-1/qx/PressReleaseview.htm.

[16] “Same Origin Policy,” Available at: http://www.mozilla.org/projects/security/

components/same-origin.html.

[17] C. Patridge, T. Mendez, and W. Milliken, “Host anycasting service,”RFC 1546,

1993.

[18] T. Griffin and G. Wilfong, “An Analysis of BGP Convergence Properties,” inProc.of

ACM SIGCOMM, September 1999.

[19] G. Appenzeller, I. Keslassy, and N. McKeown, “Sizing router buffers,” in

Proceedingsof ACM SIGCOMM, Aug. 2004. [Online]. Available: citeseer.ist.psu.

edu/article/appenzeller04sizing.html

[20] A. Kuzmanovic and E. Knightly, “Low-rate TCP-targeteddenial of service attacks

(the shrew vs. the mice and elephants,” inProceedingsof ACM SIGCOMM, Aug.

2003. [Online]. Available: citeseer.ist.psu.edu/kuzmanovic03lowrate.html

[21] R. Mahajan, S. Floyd, and D. Wetherall, “Controlling high-bandwidth flows at the

congested router,” inICNP, Nov. 2001. [Online]. Available: citeseer.ist.psu.edu/

article/mahajan01controlling.html

150

BIBLIOGRAPHY

[22] J. Grizzard, V. Sharma, C. Nunnery, B. Kang, and D. Dagon, “Peer-to-Peer Botnets:

Overview and Case Study,” inProceedingsof the first USENIX workshopon Hot

Topicsin Botnets(HotBots’07), Apr. 2007.

[23] J. Liang, N. Naoumov, and K. W. Ross, “The Index Poisoning Attack in P2P File

Sharing Systems,” inProcedingsof the 25th IEEE InternationalConferenceon

ComputerCommunications,INFOCOM, 2006.

[24] R. Brunner, “A performance evaluation of the kad-protocol,” Masters Thesis. Cor-

porate Communications Department. Institut Eurocom, France, Nov. 2006.

[25] X. Jiang, D. Xu, H. J. Wang, and E. H. Spafford, “Virtual Playgrounds for Worm

Behavior Investigation,” inProceedingsof the Eighth InternationalSymposiumon

RecentAdvancesin IntrusionDetection(RAID), Sep. 2005.

[26] F. Bellard, “Qemu, a fast and portable dynamic translator,” in Proceedingsof the

USENIX AnnualTechnicalConference,FREENIX Track., 2005.

[27] M. Steiner, T. En-Najjary, and E. W. Biersack, “A globalview of KAD,” in

Proceedingsof theInternetMeasurementConference,IMC, 2007.

[28] B. Sterling, “Microsoft Battles the Storm Worm,” Available at: http://blog.wired.

com/sterling/2007/09/microsoft-battl.html, Sep. 2007.

[29] MaxMind LLC, “MaxMind GeoIP Country Database,” Available at http://www.

maxmind.com/, 2007.

151

BIBLIOGRAPHY

[30] G. Keizer, “Massive spam shot of ’Storm Trojan’ reachesrecord pro-

portions,” Available at: http://computerworld.com/action/article.do?command=

viewArticleBasic&articleId=9016420, 2007.

[31] A. Ramachandran and N. Feamster, “Understanding the network-level behavior of

spammers,”SIGCOMMComput.Commun.Rev., vol. 36, no. 4, pp. 291–302, 2006.

[32] “Composite Blocking List,” Available at: http://cbl.abuseat.org/.

[33] D. C. Hart, “Real Time DNSBL and Spam Trap,” Available at: http://tqmcube.com/.

[34] Admins WebSecurity GbR, “Germany’s first Spam Protection Database,” Available

at: http://www.uceprotect.net/en/index.php/.

[35] D. Stutzbach and R. Rejaie, “Understanding churn in peer-to-peer networks,” in

Proceedingsof the6th InternetMeasurementConference(IMC). New York, NY,

USA: ACM Press, 2006, pp. 189–202.

[36] “Storm worm now just a squall,” Available at: http://www.washingtonpost.com/

wp-dyn/content/article/2007/10/22/AR2007102200021pf.html.

[37] “Measuring the success rate of storm worm,” Available at: http://honeyblog.org/

archives/156-Measuring-the-Success-Rate-of-Storm-Worm.html.

[38] “Macosx malware latches onto bluetooth vulnerability,” Available at http://www.

theregister.co.uk/2006/02/17/macosxbluetoothworm, 2006.

152

BIBLIOGRAPHY

[39] “CRAWDAD: A community resource for archiving wirelessdata at Dartmouth,”

available from: http://crawdad.cs.dartmouth.edu/dartmouth/campus.

[40] D. Moore, “Network Telescopes: Observing Small or Distant Security Events,” in

11th USENIX SecuritySymposium,Invited Talk, Aug. 2002.

[41] Y. Wang, D. Chakrabarti, C. Wang, and C. Faloutsos, “Epidemic spreading

in real networks: An eigenvalue viewpoint,” in22nd Symposiumon Reliable

DistributedComputing,Florence,Italy, Oct. 6-8, 2003., 2003. [Online]. Available:

citeseer.ist.psu.edu/wang03epidemic.html

[42] M. Balazinska and P. Castro, “Characterizing Mobilityand Network Usage in a Cor-

porate Wireless Local-Area Network,” in1st InternationalConferenceon Mobile

Systems,Applications,andServices(MobiSys), San Francisco, CA, May 2003.

[43] R. Jain, A. Shivaprasad, D. Lelescu, and X. He, “Towardsa model of user mobility

and registration patterns,”SIGMOBILE Mob.Comput.Commun.Rev., vol. 8, no. 4,

pp. 59–62, 2004.

[44] S. Eubank, V. S. A. Kumar, M. V. Marathe, A. Srinivasan, and N. Wang, “Structural

and algorithmic aspects of massive social networks,” inProceedingsof thefifteenth

annualACM-SIAM symposiumon Discretealgorithms, 2004, pp. 718–727.

[45] S. Staniford, D. Moore, V. Paxson, and N. Weaver, “The Top Speed of Flash Worms,”

153

BIBLIOGRAPHY

in Proceedingsof theACM Workshopon RapidMalcode(WORM), Oct. 2004, pp.

33–42.

[46] M. Bailey, E. Cooke, F. Jahanian, J. Nazario, and D. Watson, “Internet motion

sensor: A distributed blackhole monitoring system,” inProceedingsof the ISOC

NetworkandDistributedSystemSecuritySymposium(NDSS), 2005.

[47] M. A. Rajab, F. Monrose, and A. Terzis, “On the effectiveness of Distributed Worm

Monitoring,” in Proceedingsof UsenixSecurity, 2005.

[48] “The CAIDA Dataset on the Witty Worm - March 19-24, 2004,Colleen Shannon

and David Moore, http://www.caida.org/passive/witty/. Support for the Witty Worm

dataset and the UCSD Network Telescope are provided by CiscoSystems, Limelight

Networks, DHS, NSF, CAIDA, DARPA, Digital Envoy, and CAIDA Members.”

[49] M. A. Rajab, F. Monrose, and A. Terzis, “Fast and EvasiveAttacks: Highlighting

the challenges ahead,” inProceedingsof the9th InternationalSymposiumonRecent

Advancesin IntrusionDetection(RAID), Sep. 2006.

[50] H. Hethcote, “The Mathematics of Infectious Diseases,” in SIAM Reviews,Vol. 42

No. 4, 2000.

[51] Z. Chen, L. Gao, and K. Kwiat, “Modeling the Spread of Active Worms,” in

Proceedingsof IEEE INFOCOMM, vol. 3, 2003, pp. 1890 – 1900.

[52] G. S. Canright and K. Engo-Monsen, “Epidemic Spreadingover Networks - A View

154

BIBLIOGRAPHY

from Neighbourhoods,”Telektronikk, vol. 2005, no. 1, 2005, available at: http://

www.telenor.com/telektronikk/volumes/pdf/1.2005/Page 065-085.pdf.

[53] S. Staniford, V. Paxson, and N. Weaver, “How to 0wn the internet in your spare

time,” in Proceedingsof the11th USENIX SecuritySymposium, Aug. 2002.

[54] E. Anderson, K. Eustice, S. Markstrum, M. Hansen, and P.Reiher, “Mobile con-

tagion: Simulation of infection and defense,” inProceedingsof the19th Workshop

on Principlesof AdvancedandDistributedSimulation(PADS). Washington, DC,

USA: IEEE Computer Society, 2005, pp. 80–87.

[55] J. Su, K. W. Chan, A. G. Miklas, K. Po, A. Akhavan, S. Saroiu, E. de Lara, and

A. Goel, “A Preliminary Investigation of Worm Infections ina Bluetooth Environ-

ment,” in4th Workshopon RapidMalcode, 2006.

[56] J. W. Mickens and B. D. Noble, “Modeling epidemic spreading in mobile environ-

ments,” inWiSe ’05: Proceedingsof the4th ACM workshopon Wirelesssecurity.

New York, NY, USA: ACM Press, 2005, pp. 77–86.

[57] J.-K. Lee and J. C. Hou, “Modeling steady-state and transient behaviors of user

mobility:: formulation, analysis, and application,” inMobiHoc ’06: Proceedings

of the seventhACM internationalsymposiumon Mobile ad hoc networkingand

computing. New York, NY, USA: ACM Press, 2006, pp. 85–96.

155

BIBLIOGRAPHY

[58] “Cisco network admission control,” CISCONAC: Available at http://www.cisco.

com/en/US/netsol/ns466/networkingsolutionspackage.html.

[59] C. Weaver, D. Ellis, S. Staniford, and V. Paxson, “Wormsvs Perimeters: The Case

for Hard-LANs,” in Proceedingsof the 12th Annual IEEE Symposiumon High

PerformanceInterconnects, 2004.

[60] D. Whyte, E. Kranakis, and P. V. OorSchot, “ARP-Based Detection of Scanning

Worms within an Enterprise Network,” inProceedingsof the Annual Computer

SecurityApplicationsConference(ACSAC), 2005.

[61] S. E. Schechter, J. Jung, and A. W. Berger, “Fast detection of scanning worm infec-

tions,” in Proceedingsof the 7th InternationalSymposiumon RecentAdvancesin

IntrusionDetection(RAID), 2004.

[62] S. Sarat and A. Terzis, “On Using Mobility to Propagate Malware,” inProceedings

of the 5th InternationalSymposiumon Modeling andOptimizationin Mobile, Ad

Hoc,andWirelessNetwork(WiOpt), Apr. 2007.

[63] Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang, “Worm Origin Iden-

tification Using Random Moonwalks,” inProceedingsof the IEEE Symposiumon

SecurityandPrivacy, May 2005, pp. 242–256.

[64] Y. Xie, V. Sekar, M. K. Reiter, and H. Zhang, “Forensic Analysis for Epidemic At-

156

BIBLIOGRAPHY

tacks in Federated Networks,” inProceedingsof theIEEE InternationalConference

on NetworkProtocols, Oct. 2006.

[65] J. Bethencourt, J. Franklin, and M. Vernon, “Mapping Internet Sensors with Probe

Response Attacks,” inProceedingsof the14th USENIX SecuritySymposium, Aug.

2005, pp. 193–212.

[66] F. Campos, M. Karaliopoulos, M. Papadopouli, and H. Shen, “Spatio-Temporal

Modeling of Traffic Workload in a Campus WLAN,” inProceedingsof theSecond

AnnualInternationalWirelessInternetConference, Boston, USA, 2006.

[67] C. Shannon and D. Moore, “The Spread of the Witty Worm,”IEEE Securityand

PrivacyMagazine, vol. 2, no. 4, pp. 46–50, Jul. 2004.

[68] A.-H. Kim and B. Karp, “Autograph: Toward Automated, Distributed Worm Signa-

ture Detection,” inProceedingsof the 13th UsenixSecuritySymposium(Security

2004), 2004.

[69] S. Singh, C. Estan, G. Varghese, and S. Savage, “Automated Worm Fingerprinting,”

in Proceedingsof the6th ACM/USENIX Symposiumon OperatingSystemDesign

andImplementation(OSDI), 2004.

[70] P. Akritidis, W. Chin, V. Lam, S. Sidiroglou, and K. Anagnostakis, “Proxim-

ity Breeds Danger: Emerging Threats in Metro-area WirelessNetworks,” in

Proceedingsof the16th USENIX SecuritySymposium, 2007.

157

BIBLIOGRAPHY

[71] “W32.Witty.Worm,” Mar. 2004, available from: http://securityresponse.symantec.

com/avcenter/venc/data/w32.witty.worm.html.

[72] A. Kumar, V. Paxson, and N. Weaver, “Exploiting Underlying Structure for Detailed

Reconstruction of an Internet-scale Event,” inProceedingsof Usenix1st Workshop

on Stepsto ReducingUnwantedTraffic on theInternet(SRUTI), 2005.

[73] M. Casado, T. Garfinkel, M. Freedman, A. Akella, D. Boneh, N. McKeowon,

and S. Shenker, “SANE: A Protection Architecture for Enterprise Networks,” in

Proceedingsof the15th UsenixSecuritySymposium, August 2006.

[74] M. Casado, M. Freedman, J. Pettit, J. Luo, N. McKeowon, and S. Shenker,

“ETHANE: Taking Control of the Enterprise,” inProceedingsof ACM SIGCOMM,

2007.

[75] “The Samy Worm,” Available at: http://namb.la/popular/tech.html, October 2005.

[76] “Konqueror Web Browser,” Available at: http://www.konqueror.org/features/

browser.php.

[77] A. Moshchuk, T. Bragin, and D. Deville, “Spyproxy: Execution-based Detection

of Malicious Web Content,” inProceedingsof the SixteenthUSENIX Security

Symposium, 2007.

[78] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and N. Modadugu, “The

158

BIBLIOGRAPHY

Ghost in the Browser: Analysis of Web-based Malware,” inProceedingsof thefirst

USENIX workshopon Hot Topicsin Botnets(HotBots’07), Apr. 2007.

[79] D. Crockford, “JSONRequest,” Available at: http://www.json.org/module.html.

[80] C. Jackson and H. J. Wang, “Subspace: Secure Cross-Domain Communica-

tion for Web Mashups,” inProceedingsof the Sixteenth World Wide Web

Conference(WWW), May 2007.

[81] T. Jim, N. Swamy, and M. Hicks, “Defeating script injection attacks with browser-

enforced embedded policies,” inWWW ’07: Proceedingsof the16th international

conferenceon World WideWeb. New York, NY, USA: ACM, 2007, pp. 601–610.

[82] H. J. Wang, X. Fan, J. Howell, and C. Jackson, “Protection and communication ab-

stractions for web browsers in mashupos,” inSOSP’07: Proceedingsof twenty-first

ACM SIGOPSsymposiumonOperatingsystemsprinciples. New York, NY, USA:

ACM, 2007, pp. 1–16.

[83] V. T. Lam, S. Antonatos, P. Akritidis, and K. G. Anagnostakis, “Puppetnets: misus-

ing web browsers as a distributed attack infrastructure,” in CCS’06: Proceedingsof

the13thACM conferenceon Computerandcommunicationssecurity. New York,

NY, USA: ACM, 2006, pp. 221–234.

[84] T. Hardie, “Distributing authoritative name servers via shared unicast addresses,”

RFC3258, Apr. 2002.

159

BIBLIOGRAPHY

[85] S. Sarat and A. Terzis, “On the Use of Anycast in DNS,” HiNRG, Johns Hopkins

University Technical Report, Tech. Rep., Dec 2004.

[86] I. Research, “Planet Lab,” 2002, http://www.planet-lab.org/.

[87] R. Elz, R. Bush, S. Bradner, and M. Patton, “Selection and Operation of Secondary

DNS Servers,” Jul. 1997.

[88] Y. Rekhter and T. Li, “A Border Gateway Protocol 4 (BGP-4),” RFC1771, March

1995.

[89] J. Abley, “A Software Approach to Distributing Requests for DNS Service

Using GNU Zebra, ISC BIND 9, and FreeBSD,” inProceedingsof USENIX

2004 Annual TechnicalConference,FREENIX Track, 2004. [Online]. Available:

http://www.usenix.org/events/usenix04/tech/sigs/abley.html

[90] R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose,“DNS Security In-

troduction and Requirements,” Work in progress: draft-ietf-dnsext-dnssec-intro-08,

December 2003.

[91] I. S. C. Inc, “ISC F-Root,” http://www.isc.org/ops/f-root/.

[92] J. Abley, “Hierarchical Anycast for Global Service Distribution,” 2003, http://www.

isc.org/pubs/tn/?tn=isc-tn-2003-1.html.

[93] “RIPE NCC K-Root,” http://k.root-servers.org/.

160

BIBLIOGRAPHY

[94] “The Route Views Project,” available at http://www.antc.uoregon.edu/route-views/.

[95] D. Wessels, M. Fomenkov, N. Brownlee, and K. Claffy, “Measurements and Labo-

ratory Simulations of the Upper DNS Hierarchy,” inProceedingsof PAM 2004, Apr.

2004.

[96] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian, “Delayed internet routing conver-

gence,” inProceedings of ACM SIGCOMM 2000, 2000, pp. 175–187.

[97] S. Savage, A. Collins, E. Hoffman, J. Snell, and T. Anderson, “The End-to-End

Effects of Internet Path Selection,” inProceedingsof SIGCOMM1999, Aug. 1999.

[98] N. Spring, R. Mahajan, and T. Anderson, “Quantifying the Causes of Path

Inflation,” in Proceedingsof ACM SIGCOMM, Aug. 2003. [Online]. Available:

http://www.acm.org/sigcomm/sigcomm2003/papers/p113-spring.pdf

[99] L. Gao and F. Wang, “The extent of AS path inflation by routing policies,”

in Proceedings of Global Internet Symposium, 2002, 2002. [Online]. Available:

citeseer.ist.psu.edu/gao02extent.html

[100] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of

the internet topology,” inSIGCOMM, 1999, pp. 251–262. [Online]. Available:

citeseer.ist.psu.edu/michalis99powerlaw.html

[101] P. B. Danzig, K. Obraczka, and A. Kumar, “An Analysis ofWide-Area Name Server

Traffic,” in ACM SIGCOMM’92, 1992.

161

BIBLIOGRAPHY

[102] N. Brownlee and I. Ziedins, “Response time distributions for global name servers,”

in Proceedingsof PAM 2002Workshop, Mar. 2002.

[103] J. Pang, J. Hendricks, A. Akella, S. Seshan, and B. M. and R. Prisco, “Avail-

ability, Usage and Deployment Characterisitics of the Domain Na me System,” in

Proceedings of the ACM IMC 04, 2004.

[104] P. Boothe and R. Bush, “Dns anycast stability: Some early results,” Available at

http://rip.psg.com/˜randy/050223.anycast-apnic.pdf,2005.

[105] J. Jung, E. Sit, H. Balakrishnan, and R. Morris, “DNS Performance and the Effec-

tiveness of Caching,”IEEE/ACM Trans.on Networking, Oct. 2002.

[106] “The AS112 Project,” http://www.as112.net.

[107] C. Huitema, “An Anycast Prefix for 6to4 Relay Routers,”RFC3068, June 2001.

[108] B. R. Greene and D. Mcpherson, “ISP Security: Deploying and Using Sinkholes,”

http://www.nanog.org/mtg-0306/sink.html.

[109] R. Engel, V. Peris, and D. Saha, “Using IP Anycast for Load distribution and Server

Location,” inProceedingsof GlobalInternet, Dec. 1998.

[110] H. Ballani and P. Francis, “Towards a deployable IP Anycast Service,” in

Proceedingsof WORLDS, Dec. 2004.

[111] K. Park, V. S. Pai, L. Peterson, and Z. Wang, “CoDNS Improving DNS Performance

and Reliability via Cooperative Lookups,” inProceedingsof OSDI’04, Dec. 2004.

162

BIBLIOGRAPHY

[112] V. Ramasubramanian and E. G. Sirer, “The Design and Implementation of a Next

Generation Name Service for the Internet,” inProceedingsof ACM SIGCOMM

2004, Aug. 2004.

[113] C. Villamizar and C. Song, “High performance TCP in ANSNET,” SIGCOMM

ComputerCommunicationsReview, vol. 24, no. 5, pp. 45–60, 1994.

[114] S. Floyd and V. Jacobson, “Random Early Detection gateways for Congestion

Avoidance,” IEEE/ACM Transactionson Networking, vol. 1, no. 4, pp. 397–413,

1993. [Online]. Available: citeseer.ist.psu.edu/floyd93random.html

[115] M. Allman and V. Paxson, “On estimating end-to-end network path properties,” in

Proceedingsof ACM SIGCOMM, Aug. 1999. [Online]. Available: citeseer.csail.

mit.edu/allman99estimating.html

[116] S. Floyd and E. Kohler, “Internet research needs better models,” inProceedingsof

HotNets-I, Oct. 2002. [Online]. Available: citeseer.ist.psu.edu/floyd02internet.html

[117] S. Floyd and K. Fall, “Promoting the Use of End-to-End Congestion Control in the

Internet,” IEEE/ACM Transactionson Networking, vol. 7, no. 4, p. 458473, Aug.

1999.

[118] J. Ioannidis and S. M. Bellovin, “Implementing pushback: Router-based defense

against DDoS attacks,” inProceedingsof NDSS, February 2002. [Online].

Available: citeseer.ist.psu.edu/ioannidis02implementing.html

163

BIBLIOGRAPHY

[119] S. Savage, D. Wetherall, A. R. Karlin, and T. Anderson,“Practical network

support for IP traceback,” inProceedings of ACM SIGCOMM, 2000, pp. 295–306.

[Online]. Available: citeseer.ist.psu.edu/savage00practical.html

[120] Y.-K. Kwok, R. Tripathi, Y. Chen, and K. Hwang, “HAWK: Halting Anomalies

with Weighted Choking to Rescue Well-Behaved TCP Sessions from Shrew DoS

Attacks,” USC Internet and Grid Computing Lab, Tech. Rep. 2005-5, Feb. 2005.

[121] “Security bites podcast: Here come the http botnets,”Available at: http://www.news.

com/2324-126403-6225814.html.

[122] P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu, “A study of the short

message service of a nationwide cellular network,” inProceedingsof theACM IMC,

2006.

164

Vita

Sandeep Sarat received the B. Tech degree in Computer Science & Engineering from

the Indian Institute of Technology, Madras in 2001, and enrolled in the Computer Science &

Engineering Ph.D. program at the Johns Hopkins University in 2001. His research focuses

on the measurement, detection and containment of current and emerging security threats

on the Internet. His interests lie in the intersection of networks, operating systems and

security.

Starting in June 2008, Sandeep will work at Google in their New York Office.

165

a multi-faceted approach to countering internet...

Documents