a multi-faceted approach to countering internet...
TRANSCRIPT
A Multi-faceted Approach to Countering Internet Threats
by
Sandeep Sarat
A dissertation submitted to The Johns Hopkins University inconformity with the
requirements for the degree of Doctor of Philosophy.
Baltimore, Maryland
May, 2008
c© Sandeep Sarat 2008
All rights reserved
Abstract
While the Internet has revolutionalized the communicationlandscape, it continues to
be plagued by issues of robustness and security that threaten the network’s operation. In
this dissertation, we employ a multi-faceted approach to tackle these issues. Specifically,
measurement studies are performed to experimentally quantify the ground truth related to
the robustness of the Internet infrastructure and securitythreats in the wild. This work is
complemented by research on future Internet threats and mechanisms to contain them. The
source of these threats are novel classes of malware which exploit the increasing attack
surface, presented by rapidly evolving Internet technologies.
As an instance of quantifying the emerging security threatsin the wild, we performed
a wide-scale measurement study of the Storm botnet. This botnet, which represents the
leading edge in botnet technology, uses a distributed peer-to-peer(P2P) architecture and
aggressively defends itself. We developed a crawler which actively scourged the P2P bot-
net, to determine its size and other structural properties.This study shows how traditional
P2P distributed hash tables (DHTs) differ from botnet DHTs,eventhough they use the same
underlying P2P protocol - mainly due to the diverse interests of participating entities within.
ii
ABSTRACT
In particular, we could precisely identify nodes which poison the DHT index, using simple
heuristics. Such a capability is particularly alarming to the security community, consider-
ing that the storm botnet aggressively defends itself by carrying out DDoS attacks. The
bigger implication of this study is that current botnet monitoring techniques easily lend
themselves to miscreant counter-intelligence techniques, thereby motivating the need for
stealthier monitoring techniques.
Since network defense is essentially an arms race, we not only address current threats,
but also look ahead to threats that are likely to arise in the future. We show that, as mo-
bile devices become pervasive and more powerful, malware can exploit their mobility pat-
terns to trivially propagate around perimeter defenses such as firewalls. Using an analytical
model, we estimate the speed with which such infections can propagate over a population of
nomadic users. We validate our results using realistic mobility traces from a campus-wide
wireless network with hundreds of access points and thousands of mobile users. We show
that, while the speed of propagation of mobile malware is slower when compared to tradi-
tional Internet worms, it is still fast enough to render manual countermeasures implausible.
Furthermore, we develop a novel probabilistic technique which advocates using a modified
version ofrandom moonwalksto provide early detection of such mobile malware. The
proposed technique can reliably detect and pinpoint the origin of a mobile infection in the
early stages of its evolution itself.
As another direction in countering threats, we address vulnerabilities of the web
browser. The browser is the single most widely used application on the Internet today.
iii
ABSTRACT
However, its security policies are largely antiquated in today’s increasingly multi-principal,
asynchronous programming model of the web. We develop two novel abstractions - one
for sand-boxing untrusted third party contents and anotherwhich enables controlled shar-
ing between domains, to transform the browser into a truly multi-principal platform.
Given that worms pose a global scale threat to the Internet infrastructure, an evalua-
tion of robustness, quantifying the ability of the Internetto withstand attacks, is essential.
Specifically, we measured the performance of anycast on fourtop-level domain name ser-
vice (DNS) zones, which allowed us to quantify the reliability and resiliency of the DNS
zones against large scale distributed denial of service (DDoS). We showed that outages in
DNS service are indeed rare. However, when they do occur, outages can last upto multiple
minutes, mainly due to slow Border Gateway Protocol(BGP) convergence
At the other end of the spectrum, Internet routers can also bethe subjects of DDoS
attacks. One such attack, known as the shrew attack, consists of small traffic bursts that
can temporarily inundate a router’s queue while at the same time, evade detection due to
their low average transmission rate. Using simple mathematical analysis and simulation,
we show that a relatively small buffer, combined with a fair queue management scheme, is
sufficient to detect/thwart low rate TCP attacks against routers.
Advisor: Dr. Andreas Terzis, Department of Computer Science, JohnsHopkins University
Primary Reader: Dr. Gerald M. Masson, Department of Computer Science, Johns Hop-
kins University
iv
ABSTRACT
Secondary Reader:Dr. Cristina Nita-Rotaru, Department of Computer Science, Purdue
University
v
Acknowledgements
My first, and most earnest, acknowledgements must go to my advisor Prof. Andreas
Terzis. He took me under his wing, when I was at a crossroads inthe pursuit of the doctorate
degree in 2004. Thereafter, he has been one of the most affable advisors, I have ever known.
I shall remain grateful to him for granting me freedom and guiding me in the research topics
of my fancy.
I wish to thank Professor Jonathan Shapiro, for his extremely insightful rants and intro-
ducing me to the world of systems and security in the early days of my PhD. I also wish to
thank Professor Rao Kosaraju, for providing me with an opportunity to TA the Randomized
Algorithms course, multiple times. My interactions with Prof. Shapiro and Prof. Kosaraju,
have taught me the importance of a maintaining a balanced perspective, involving both a
theoretical and a systemic standpoint. I also wish to thank Dr. Gerald M. Masson and Dr.
Cristina Nita-Rotaru for their valuable critiques while serving on my defense committee.
I have been part of the Hopkins Internetworking Group(HiNRG) during the past five
years. The meticulously maintained schedule of each of its inhabitants will remain etched
in my brain forever. I thank Razvan Musaloiu-E., for all his technical help, the monthly
vi
ACKNOWLEDGEMENTS
photos shoots, and the chocolates, Moheeb Abu Rajab for being the only other colleague in
HiNRG pursuing network security, Chieh-Lan Mike Liang for adhering to my dark room
policy and listening to my nonstop drivel, Jeongil (John)ko, Yin Chen and Sam Small for
putting up with me.
The time spent outside of the department, be it on the lush meadows of the upper quad,
in #343, in the JHUCC van or the basement music room have been memorable owing
largely to - Amit Paliwal for his witty remarks, Puneet Bajpai, Utkarsh Sharma, Paritosh
Shroff for their camaraderie, Dheeraj Singaraju, Ashley Fernandes, Supratim Ray, and
Purshottam Dixit for jamming along withHallowed be thy name, ever so often, Kishore
Kothapalli, Anshumal Sinha, Sridhar Swaroop and Pramod Singh Thakur for all the dis-
cussions on diverse subjects, and Ranganath Teki, Santosh Vijaykumar, Piyush Jain and
Saurabh Paliwal for being my roommates. Finally a big thanksto Harris(1806-1860) who
egged me on my way to the lab daily.
The acknowledgment of the long-term support of my parents, Saratchandran and
Meera, and my sister, Sapna, may be ritual in a pursuit of thissort, but is nonetheless
necessary, apt and heartfelt.
vii
Contents
Abstract ii
Acknowledgements vi
List of Tables xiii
List of Figures xiv
1 Introduction 1
1.1 Motivation and a Brief Chronology . . . . . . . . . . . . . . . . . . .. . . 2
1.2 Thesis Contribution and Outline . . . . . . . . . . . . . . . . . . . .. . . 3
1.2.1 Tracking P2P Botnets . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Using Mobility to Propagate Malware . . . . . . . . . . . . . . .. 5
1.2.3 Isolation and Sharing in the Web Browser . . . . . . . . . . . .. . 6
1.2.4 Anycast in DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.5 Router Buffer and Robustness . . . . . . . . . . . . . . . . . . . . 8
viii
CONTENTS
2 On Tracking P2P Botnets 10
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Command and Control . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Encrypted Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . .16
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Node ID distribution . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Population Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Relationship between peer addresses and identifiers .. . . . . . . . 20
2.3.4 Data from Spam Block Lists . . . . . . . . . . . . . . . . . . . . . 23
2.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 On Using Mobility to Propagate Malware 30
3.1 Worm Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Mobility Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Mobile node infection . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Mixing mobile and static nodes . . . . . . . . . . . . . . . . . . . 39
3.3 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Detection Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ix
CONTENTS
3.4 Spatial evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Popularity is dynamic . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.2 Evasive worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . .49
3.7 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . .. 51
4 On the Detection and Origin Identification of Mobile Worms 53
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Random Moonwalks . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Mobile Worm Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Random Moonwalks and Mobile Worms . . . . . . . . . . . . . . 57
4.2.2 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.3 Effect of infection on moonwalk length . . . . . . . . . . . . .. . 63
4.3 Worm Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 On Web Browser Protection 75
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
x
CONTENTS
5.1.1 Same Origin Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 XSS attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Trust Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Konqueror Implementation . . . . . . . . . . . . . . . . . . . . . . . . .. 82
5.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 On the Use of Anycast in DNS 87
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . .92
6.3 Anycast Deployment Strategies . . . . . . . . . . . . . . . . . . . . .. . . 96
6.3.1 Multiple Instances, One site: B-Root . . . . . . . . . . . . . .. . 96
6.3.2 Multiple Instances, Multiple Heterogeneous Sites: F,K-root . . . . 97
6.3.3 Multiple Instances, Multiple Homogeneous Sites: UltraDNS . . . . 99
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.1 Response times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4.3 Constancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.4 Effectiveness of Localization . . . . . . . . . . . . . . . . . . .. . 115
6.4.5 Comparison of Deployment Strategies . . . . . . . . . . . . . .. . 119
6.5 Effect of Advertisement Radius . . . . . . . . . . . . . . . . . . . . .. . . 120
6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
xi
CONTENTS
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7 On the Effect of Router Buffer Sizes on Low-Rate Denial of Service Attacks 127
7.1 The Shrew Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Mathematical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .131
7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Low Speed Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.2 High Speed Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8 Future Work 144
8.0.1 Botnets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.0.2 Mobile Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.0.3 Web based Malware . . . . . . . . . . . . . . . . . . . . . . . . . 147
Bibliography 148
Vita 165
xii
List of Tables
4.1 Simulation Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59
6.1 Distribution of used PlanetLab nodes around the world. .. . . . . . . . . . 946.2 List of the 26 F-root sites. The last column shows the percentage of Plan-
etLab nodes served by each F-root cluster. Example of an F-Root server isSFO2a.f-rootservers.net. . . . . . . . . . . . . . . . . . . . . . . . . . . .98
6.3 List of the 7 K-root sites. . . . . . . . . . . . . . . . . . . . . . . . . . .. 996.4 The list of the 8 UltraDNS clusters reachable from PlanetLab. . . . . . . . 1006.5 Statistics of DNS response times . . . . . . . . . . . . . . . . . . . .. . . 1046.6 Percentage of flips due to outages. . . . . . . . . . . . . . . . . . . .. . . 111
7.1 Notation used in the mathematical analysis of the shrew attack. . . . . . . . 1317.2 Aggregate link utilization from 20 TCP flows. . . . . . . . . . .. . . . . . 1377.3 Aggregate TCP link utilization for 250 flows. . . . . . . . . . .. . . . . . 139
xiii
List of Figures
2.1 The distribution of Storm bot IDs over the 128-bit hash space for (a) origi-nal Storm botnet (b) encrypted Storm botnet. The results in this figure arebased on data collected on 11/19/07. . . . . . . . . . . . . . . . . . . . .. 18
2.2 Population estimates of the botnets from 11/09/2007 - 1/29/2008 for the (a)Older storm botnet (b) Encrypted storm botnet. . . . . . . . . . . .. . . . 20
2.3 Top 15 countries in which peers are located percentage wise. The last barNA (Not Available) comprises of non-publicly routable IP addresses. (a)Older storm botnet (b) Encrypted storm botnet. . . . . . . . . . . .. . . . 21
2.4 (a) Distribution of IDs attributed to unusable IP addresses in the originalStorm network. (b) The distribution of IDs attributed to valid IP addressesin the same network. The x-axis represents the 128-bit hash space. . . . . . 22
2.5 Cumulative density function of the number of IDs associated with a sin-gle IP address and port. Unreachable/non-routable IP addresses were notincluded in this distribution. . . . . . . . . . . . . . . . . . . . . . . . .. 23
2.6 Correspondence between the number of IDs published by anIP address andoccurrence in spam black lists. . . . . . . . . . . . . . . . . . . . . . . . .25
3.1 Percentage of infected users as a function of time as predicted by the ana-lytical model and as demonstrated by simulation. . . . . . . . . .. . . . . 39
3.2 (a)Rate of domain infections as a function of time with the total mobilepopulation (b) Rate of infection with only 25% of the mobile nodes . . . . 41
3.3 The first time an infected node is seen at a network domain as a function ofthe domain’s popularity, defined as the number of cumulativenode-hoursoccupancy of a domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Detection time when monitors are deployed in the top x% ofthe domains. . 463.5 (a) Similarity between the popularity of the top 50 domains on a weekly
basis for 2004 (b) Median detection time if the monitors are deployed stat-ically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Worm evolution when the worm is inactive in the top 50 domains. . . . . . 49
xiv
LIST OF FIGURES
4.1 (a) Random moonwalk on a network with no malicious traffic. (b) Randommoonwalk on the same network when a worm is injected att ∼ 167 min.The y-axis represents the frequency with which flows starting at a particulartime, appear in the set of paths traversed by the moonwalks. .. . . . . . . 60
4.2 (a) Average moonwalk length for a network with no malicious traffic anda network in which a worm is injected att ∼ 420 min. Graphs are shownwhen 100% and 75% of the population is vulnerable (b) Percentage ofinfected nodes as a function of time for the same worm. . . . . . .. . . . . 61
4.3 Average moonwalk length for a network with different volumes of normaltraffic. The curve labelled ’High’ corresponds to double thevolume oftraffic in ’Norm’, while the curve labelled ’Low’ representsa scenario inwhich the traffic is halved. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 The scatter plot of walk length versus root node frequency. The square dotindicates the actual patient zero. . . . . . . . . . . . . . . . . . . . . .. . 68
4.5 Candidate infection trees reconstructed using a BFS search. The tree rootedat 5344 is the actual infection tree. All nodes in this tree were indeedinfected by the worm. The trees rooted at 1167 and 2148 are benign. Adirected edge between nodeX andY indicates thatX initiated at least oneflow to Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Percentage of mobiles nodes that need to be inspected forsigns of infectionas a function of the normal traffic intensity. . . . . . . . . . . . . .. . . . 71
5.1 The proxy extension overlayed on top of a simplified javascript call graph. . 83
6.1 Sample Anycast configuration . . . . . . . . . . . . . . . . . . . . . . .. 906.2 Histogram of correspondence between TLD1 vs TLD2 clusters contacted
by PlanetLab nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.3 Response time CDF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056.4 Percentage of unanswered queries by various servers. . .. . . . . . . . . . 1066.5 CDF of outage duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076.6 CDF of inter-outage duration . . . . . . . . . . . . . . . . . . . . . . .. . 1086.7 Number of outages observed by various servers. . . . . . . . .. . . . . . . 1096.8 Number of flips observed as percentage of the total numberof queries sent
to each nameserver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.9 Period of time that PlanetLab nodes query the same serverfor the moni-
tored servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1126.10 CDF of the cluster stability of F-Root and K-root. . . . . .. . . . . . . . . 1136.11 Correlation of outages and flips for the F-root server. Asimilar correlation
was observed for the K-root server. . . . . . . . . . . . . . . . . . . . . .1146.12 Additional round trip time for client queries to the anycast-selected F-root
and TLD2 servers over the closest servers. . . . . . . . . . . . . . . .. . . 116
xv
LIST OF FIGURES
6.13 Additional distance over the optimal traveled by anycast queries to contacttheir F-root, K-root, TLD1 and TLD2 server. . . . . . . . . . . . . . .. . 117
6.14 Variation of server load with varying server advertisement radius for a ran-dom distribution of 200 clients. Redundancy is denoted by R.. . . . . . . . 122
6.15 Variation of average AS path length with change in the radii of the serverfor a random distribution of 200 clients. . . . . . . . . . . . . . . . .. . . 123
7.1 Square-wave shrew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1307.2 Effect of a single shrew on TCP throughput as a function ofthe RTT of
flows sharing a DropTail queue. . . . . . . . . . . . . . . . . . . . . . . . 1307.3 Dumb-bell configuration. . . . . . . . . . . . . . . . . . . . . . . . . . .. 1347.4 TCP throughput as a function of the RTT under increasing buffer sizes.
Unless otherwise specified,R = 40 msec. . . . . . . . . . . . . . . . . . . 1367.5 WhenR increases to 120 msec, it is possible to have small buffer size
(m = 2) without penalizing the TCP flows sharing the link with the shrew. . 1377.6 TCP throughput under increasing buffer sizes. . . . . . . . .. . . . . . . . 1407.7 Peak and Average Shrew sending rate needed to maintain reduced link uti-
lization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
xvi
Dedication
To all those authors, who begin with
“The Internet has witnessed an explosive growth”.
xvii
Chapter 1
Introduction
The Internet has witnessed an explosive growth over the pastdecade. Such growth has
meant that, the Internet is now ubiquitous and used by a largeset of entities, with diverse
and possibly conflicting interests. As a result of this increased global scope, security is es-
sential for any Internet enabled system. Over the years, theform and character of security
threats to network users and the network itself, have evolved significantly. The motive of
the attacker has also seen a decisive shift from fun towards profit. Furthermore, attack-
ers have become increasingly sophisticated, exploiting vulnerabilities in multiple layers of
existing technologies as well as those in emerging, less mature technologies.
1
CHAPTER 1. INTRODUCTION
1.1 Motivation and a Brief Chronology
Early worms (e.g. Code Red I, 2001 [1]), used naive a random scanning approaches to
infect new victims. However, they could be easily detected owing, to the large amounts of
noisy scans generated as a side-effect. Evidence from recent malware, such as the Agobot
virus [2], have seen attackers increasingly employ measures to avoid detection and mon-
itoring infrastructures. Furthermore, the phenomenon of botnets is now commonplace,
whereby the infected machines (bots) are unwittingly drafted into a network, called the
botnet [3]. A botnet can then be engineered to carry out a hostof secondary malicious ac-
tivity ranging from spamming, denial-of-service (DOS) attacks or phishing attacks etc. The
controllers of this network (a.k.a botmasters), communicate with the bots using a command
& control (C &C) channel, typically IRC [4]. While IRC channels serve as the predominant
method of C&C channels even today, new botnets are emerging which use a decentralized
communication mechanism, e.g. P2P, for reasons of increased robustness [5]. Finally,
the delivery mechanisms for malware have been increasing insophistication and number.
Targeted attacks either using fingerprinted web browsers and operating systems are more
commonplace. Botnet sizes typically run from the tens of thousands to millions of bots.
The impact of a DoS attack from such botnets can cause great collateral damage to orga-
nizations and the infrastructure of the Internet itself. There have been numerous instances
of such events in the past. For example, in 2007, an attack on the DNS root servers, nearly
took down three root zones [6]. Consequently, it is essential to the quantify the ability of
the Internet infrastructure to withstand such large scale DoS attacks.
2
CHAPTER 1. INTRODUCTION
1.2 Thesis Contribution and Outline
As the brief chronology in the previous paragraph suggests,the malware ecosystem is
constantly evolving. In accordance, this thesis follows a multifaceted approach towards
addressing some of the security issues facing the Internet.First, measurement studies are
performed to experimentally quantify the ground truth related to the security threats in the
wild. Specifically, we study the Storm Worm [7] which represents the leading edge in
botnet technology. Since security research is essentiallyan arms race, we develop counter-
measures for existing threats and also look ahead to threatslikely to occur in the future. We
develop novel abstractions to counter web browser vulnerabilities, e.g. cross site scripting
(XSS) [8]. These abstractions replace the antiquated browser security policies and can be
used to securely sandbox web content, while allowing controlled sharing. Advancement
in communications technology presents new avenues for malware propagation. One such
emerging phenomenon, is mobility. As mobile devices becomepervasive and more power-
ful, malware can exploit mobility as a vector for propagation. We analyze this phenomenon
and devise a novel technique to detect and contain the spreadof a mobile worm.
While the above research deals mainly with end-users, at theother end of the spectrum
is the core of the Internet. Given, the size of botnets can runinto millions and thus, their
potentially lethal DDoS capabilities, we study if the Internet is engineered for robustness.
Specifically, we study whether the DNS fabric of the Internetis robust in terms of availabil-
ity and resiliency. Finally, we conduct a short experiment on router buffer sizes and their
effect on DoS attacks. Of late, there has been a renewed interest towards reducing the size
3
CHAPTER 1. INTRODUCTION
of buffers in Internet routers. We study if this reduction would enhance lethality of DoS
attacks by makeing it easier for DoS attacks to camouflage themselves as normal traffic.
Specifically, we look at shrew attacks, which are low volume DoS attacks. Briefly then, we
present a description of the upcoming chapters and their corresponding contributions.
1.2.1 Tracking P2P Botnets
P2P botnets, which use DHTs for their C&C channels, are a relatively new entrant in
the botnet ecosystem. We track and study one such botnet, thestorm worm. The Storm bot-
net, also known asTrojan.Peacomm [9,10], made its first appearance in January 2007.
Storm is notable in its use of Kademlia [11], to coordinate the infected hosts and the use
of fast-flux DNS services to distribute binary updates [12].Moreover, Storm aggressively
defends itself by resisting reverse engineering attempts and executing DDoS attacks against
external hosts that attempt to probe its operations [13]. Due to these aggressive mechanisms
and its distributed nature, little is known about the network of Storm-infected hosts. We
developed a crawler based on the Overnet protocol, and used it to crawl the Storm DHT.
Using this crawler, we estimate that approximately 300,000end-hosts were members of
the Storm botnet during November 2007. Perhaps more important than the size estimates,
are the anomalies we discovered during this process. First,unlike traditional DHTs, the
distribution of keys stored in the DHT is not uniform over thehash space as in other P2P
systems. Furthermore, we found a small percentage of nodes that publish an abnormally
large number of IDs. We provide evidence that these findings are the side-effects of actions
4
CHAPTER 1. INTRODUCTION
by entities external to the Storm network, meant to track andinterfere with its operations.
Unfortunately, the fact that we were able to detect these activities also suggests that sophis-
ticated botmasters can also discover the “good” nodes in thebotnet (i.e., those that monitor
its operation and do not participate in malicious activities). In essence, this study shows
weakness in current botnet monitoring technology and serves as a call to arms for the de-
velopment of more stealthy methods to monitor P2P botnets. Chapter 2 of this thesis deals
with tracking P2P botnets.
1.2.2 Using Mobility to Propagate Malware
New advancements in technology, provide new avenues for malware propagation. One
such development, which has become a pervasive feature among computing devices is the
ability to remain connected while being mobile. According to industry reports, 812 million
mobile terminals were sold in 2005 and the sale of new devicesare expected to top 1
billion in 2008 [14]. Looking forward, we expect widespreadadoption of technologies
such as WiMax, mesh networks and even vehicular wireless networks. This increase in
connectivity however comes at a high price - failure to properly secure these media will
provide new avenues for malicious behaviour. As a matter of fact, the exploitation of these
media is not just our speculation: variants of the Zotob/Mytob worm are suspected to have
used “physical” transfer as a propagation strategy [15]. More recently, a series of malware
that attempt to exploit Bluetooth connections as a medium for spreading were reported in
the media.
5
CHAPTER 1. INTRODUCTION
To better understand this impending threat, we investigatehow mobility can be ex-
ploited across a large number of end-hosts. Using an analytical model, we estimate the
speed with which such mobile contagion can propagate over a population of nomadic users.
We validate our results using realistic mobility traces from a campus wide wireless network
deployment with hundreds of access points and thousands of mobile users. We show that,
while the speed of propagation of mobile malware is slower when compared to traditional
Internet worms, it is still fast enough to render manual countermeasures implausible. Fur-
thermore, given this sort of mobile contagion, we devise a novel technique usingrandom
moonwalksto provide early detection of such mobile malware. We show that the proposed
mechanism can reliably the spread of a mobile worm in the early stages of the evolu-
tion. We also devise techniques to conduct post-mortem forensic analysis of an infection,
whereby the originator of an infection (patient zero) can be identified with reasonable accu-
racy. Chapter 3 deals with the modelling of the evolution andunderstanding the properties
of mobile worms. The detection and forensic methods for mobile contagion is presented in
Chapter 4.
1.2.3 Isolation and Sharing in the Web Browser
Web browsers have relied on the Same Origin Policy [16] (SOP)to dictate trust rela-
tionships between content loaded from websites. Content includes both data and code, in
the form of scripts. The SOP prevents a document or script loaded from one site of ori-
gin from manipulating the properties of or communicating with a document loaded from
6
CHAPTER 1. INTRODUCTION
another site of origin. However, this policy provides absolutely no granularity and as we
shall discuss, its all-or-nothing approach is the source ofmany browser vulnerabilities. For
example, the cross site scripting attack (XSS) is a confuseddeputy problem which exploits
the browser’s trust in executing all the scripts presented on a webpage. We present two ab-
stractions to address this issue. First, we develop an isolation abstraction which isolates all
unauthorized content. Second, we develop a sharing abstraction, using enables controlled
communication between entities. In Chapter 5 we show that the abstractions presented
above, attacks such as XSS can be countered.
1.2.4 Anycast in DNS
While the previous topics mainly address issues with security of end-hosts, the other
key component in dealing with robustness of the Internet is the study of its core com-
ponents. The DNS is one such integral component and is used toresolve names to IP
addresses. Anycast, is a routing mechanism which is used in DNS zones [17]. We evaluate
the reliability and resiliency of anycast. In this study, weuse results from four top-level
DNS servers to evaluate whether anycast indeed improves DNSservice and compare dif-
ferent anycast configurations. Increased availability is one of the supposed advantages of
anycast and we found that indeed the number of observed outages was smaller for anycast,
suggesting that it provides a mostly stable service. On the other hand, outages can last up
to multiple minutes, mainly due to slow BGP convergence [18]. We also found that anycast
indeed reduces query latency. Furthermore, depending on the anycast configuration used,
7
CHAPTER 1. INTRODUCTION
37% to 80% of the queries are directed to the closest anycast instance. Our measurements
revealed an inherent trade-off between increasing the percentage of queries answered by
the closest server and the stability of the DNS zone, measured by the number of query
failures and server switches. Chapter 6 presents the results of our study on anycast.
1.2.5 Router Buffer and Robustness
Of late, there has been a renewed interest in reducing routerbuffer sizes. This research
is driven by increasing bandwidth speeds pushing up the costof expensive buffer memory
and the power consumption of memory chips. Router queues buffer packets during conges-
tion epochs. A recent result by Appenzeller et al. [19] showed that the size of FIFO queues
can be reduced considerably without sacrificing utilization. While Appenzeller showed that
the link utilization is not affected, the impact of this reduction on other aspects of queue
management such as fairness, is unclear. We investigate whether the reduction of buffer
size renders DoS attacks more effective. While brute force DoS attacks, can be easily de-
tected and contained, low-rate DoS attacks, called shrews can throttle TCP connections by
causing periodic packet drops [20]. Unfortunately, smaller buffer sizes make shrew attacks
more effective and harder to detect since shrews need to overflow a smaller buffer to cause
drops. We show that a relatively small increase in the buffersize over the value proposed by
Appenzeller is sufficient to render the shrew attack ineffective. Intuitively, bigger buffers
require the shrews to transmit at much higher rates to fill therouter queue. However, by
doing so, shrews are no longer low-rate attacks and can be detected by Active Queue Man-
8
CHAPTER 1. INTRODUCTION
agement (AQM) techniques such as RED-PD [21]. The results from this experiment are
presented in chapter 7.
9
Chapter 2
On Tracking P2P Botnets
Botnets, networks of compromised machines under the control of botmasters, represent
a significant threat to the Internet today. While traditionally using centralized command
and control (C&C) architectures (e.g., IRC servers) to control their bots, botmasters have
recently begun to employ P2P protocols for these tasks [4]. P2P architectures avoid the
single points of failure inherent to IRC-based botnets, thus rendering them less vulnerable
to takedownattacks, that target the C&C servers.
The Storm botnet, also known asTrojan.Peacomm [9, 10], is one such P2P botnet
which made its first appearance in January 2007. Storm is notable in its use of a DHT pro-
tocol, Kademlia [11], to coordinate the infected hosts and the use of fast-flux DNS services
to distribute binary updates [12]. Storm aggressively defends itself by resisting reverse en-
gineering attempts and executing DDoS attacks against external hosts that attempt to probe
its operations [13]. Furthermore, even an encrypted version of Storm emerged in the later
10
CHAPTER 2. ON TRACKING P2P BOTNETS
half of 2007 which used the XOR operation to encrypt its P2P network traffic. Due to such
aggressive defense mechanisms and its distributed nature,little is known about the network
of Storm-infected end-hosts.
Using these two versions of the Storm binary as examples of P2P botnets, we present
estimates of their size and properties, using a custom made crawler. Furthermore, we show
that current techniques for tracking P2P botnets lend themselves to counter-intelligence
employed by the botmasters. Such counter-intelligence canreveal the identities of the
botnet trackers and thus subject them to attacks by these powerful botnets.
Using the crawler we estimate that the population sizes of the older and encrypted ver-
sions storm botnets were around 300,000 and 30,000, respectively in November, 2007.
Perhaps, more important than the size estimates, are the anomalies unearthed during our
crawling process. Even though both versions use the underlying Kademlia-based protocol,
we discovered that the P2P keys stored in the older Storm botnet are not uniformly dis-
tributed over the hash space, as is typical of DHTs. This non-uniformity is primarily due
to keys which point to unreachable and non-routable IP addresses (e.g., private, multicast,
loopback, and unallocated IP addresses). While this irregularity is largely absent in the
latter botnet, we found other atypical artifacts common to both DHTs. For example, we
found a small percentage (< 1%) of routable IPs that publish thousands of IDs, while the
vast majority of Storm nodes publish only a small set of IDs.
We provide evidence that these findings are the side-effectsof actions by entities exter-
nal to the Storm network, meant to track and interfere with its operations. In that respect,
11
CHAPTER 2. ON TRACKING P2P BOTNETS
they represent practical applications of theindex poisoningattacks previously theorized
by a number of researchers [22, 23]. Unfortunately, the factthat we were able to detect
these activities also suggests that sophisticated botmasters can also discover the “good”
nodes in the botnet (i.e., those that monitor its operation and do not participate in malicious
activities). This capability can subsequently be used by the botmaster to launch a DDoS
attack against these nodes. Furthermore, even our crawler can be easily detected, due to
the abnormally high number of queries it generates. While distributing the tracking task
among multiple machines can alleviate the detection problem, we argue that doing so re-
quires a large distributed network of monitors due to the Storm network’s properties. Taken
as a whole, our results should serve as a call to arms for the development of more stealthy
methods to monitor P2P botnets.
The rest of the chapter is divided into six sections. The section that follows outlines
Storm’s P2P architecture, while Section 2.2 describes our measurement methodology. We
present our findings from these measurements in Section 2.3 and cover related work in
Section 2.4. Finally, we close in Section 2.5 with a summary and future research directions.
2.1 Background
We summarize the functionality of the Storm network. Given the focus of our work, we
outline Storm’s Command and Control (C&C) protocol rather than presenting its different
infection vectors and malicious activities. Readers interested in these aspects are directed
12
CHAPTER 2. ON TRACKING P2P BOTNETS
to [7,9,10].
2.1.1 Command and Control
Storm uses the Overnet protocol, which in turn is based on theKademlia Distributed
Hash Table (DHT) [11]. Each peer, as well as each object stored in an Overnet network, is
associated with a 128-bit identifier (ID). Peer identifiers are randomly generated using the
MD4 cryptographic hash function [24]. Routing in Overnet isbased on prefix matching,
whereby the distance between two IDs is equal to the XOR of thetwo identifiers. For
example, the distance betweena = 0001 andb = 1110 is d(a, b) = a⊕ b = 0001⊕1110 =
1111.
Overnet nodes organize their routing tables as lists ofk-buckets. Specifically, for each
0 ≤ i < 128, the correspondingk-bucket holds up tok(= 20) <IP address, UDP
port, Node ID> tuples corresponding to nodes whose distance from the current node
falls within the[2i, 2i+1) range. This routing table resembles an unbalanced routing tree in
which a node maintains only a few contacts to peers that are far away (i.e., corresponding
to large values ofi) and increasingly more contacts to nodes within shorter distance.
When an Overnet node receives any message (request or reply)from another node,
it updates the appropriatek-bucket with the sender’s node ID. Thek-buckets effectively
implement a least-recently seen eviction policy, except that live nodes are never removed
from the list. An Overnet node that receives a request for an ID, returns the tuples of the
k nodes it knows about that are closest to the requested ID. These tuples can come from a
13
CHAPTER 2. ON TRACKING P2P BOTNETS
singlek-bucket, or from multiplek-buckets if the closestk-bucket is not full. Routing then
proceeds iteratively, by querying each successive peer on the route to the destination.
When a new peer joins the network, it inserts itself into the contact list of other nodes
by performing a lookup for its own ID. Moreover, peers periodically query their own ID in
an effort to keep their own as well as their peers’ routing information fresh. When a peer
wants to store an object corresponding to the<ID, value> pair, it locates thek nodes
closest to the ID and asks them to store it. Finally, to limit the stale information in the
system, publisher and peers periodically publish the<ID, value> pairs that they have.
While Overnet suggests that nodes have persistent IDs, we observed that Storm-infected
hosts choose different IDs every time they reboot and also when their DHT searches fail.
Furthermore, the Storm binary has a hard-coded list of over 400 initial peers which it uses
to attach itself into the network. Considering the large percentage of end-hosts residing in
private address space, Overnet includes a special NAT discovery mechanism. Bots use this
mechanism to detect whether they reside behind a NAT device and if so to advertise their
globally visible IP address (rather than their private address) when they join the network.
In addition to performing periodic queries for their own IDs, Storm bots periodically
search for a set of keys stored in the Overnet network. According to Stewart [7], bots
generate these search keys through a built-in algorithm that uses the current date and a
random number uniformly selected from[0 . . . 31]. The values associated with those keys
contain an encrypted URL that the bots decrypt and retrieve using HTTP. We noticed that
nodes change their own IDs when key searches for these URLs fail. A bot will then rejoin
14
CHAPTER 2. ON TRACKING P2P BOTNETS
the DHT with its new ID and restart its search for the URL.
We note that because Storm uses the same Overnet protocol that popular file-sharing
networks use (e.g., eDonkey and eMule), non-infected hosts can be used to storekeys for
Storm. The botnet was probably designed this way to leveragethese P2P networks as a
bootstrap mechanism during its early stages [5].
2.1.2 Encrypted Protocol
An encrypted version of the Storm C&C protocol emerged in October 2007. While
based on the same Overnet protocol described above, the fieldtypes and values contained
in the messages exchanged using this version of the protocolare encrypted using the XOR
operation. For example, we observed that all Overnet message types were XOR’ed with
the0xAA key.
Given this weak encryption mechanism, we were able to discover the keys that Storm
uses by observing the traffic between a machine running the encrypted Storm binary and a
custom Overnet client that we developed. Specifically, we employed a setup, similar to a
virtual playground [25], in which we record and redirect allthe traffic generated by the bot
binary to the custom Overnet client. The client replies to the bot’s queries with pre-defined
responses.
We provide two examples of the methods used to retrieve the botnet’s keys. First, the
bot uses a.ini file which contains the list of peers it contacts to bootstrapitself to the
DHT. This list of peers is stored in the clear and is under our control. Then, by XOR-ing the
15
CHAPTER 2. ON TRACKING P2P BOTNETS
(encrypted) IP address included in an Overnet message with the IP address listed in the file,
one can extract the key used to encrypt IP addresses. Likewise, our Overnet client replies
to the bot’s search queries with known IP addresses. Then, byobserving the IP addresses
that the bot attempts to connect to after receiving the client’s response, we were able to
derive the keys used in search queries.
Using similar techniques we were able to retrieve all the keys necessary to traverse
Storm’s P2P network using the crawler presented in the section that follows.
2.2 Measurement Methodology
We leverage the Overnet protocol described above to discover properties of the Storm
infection. Specifically, we developed a crawler that queries the network for randomly gen-
erated keys and records the node IDs, IP addresses, and the port numbers that the peers
return. Whenever the crawler receives a new usable (i.e., routable) IP address it sends a
query for another random ID to the corresponding node. The query process continues until
the crawler’s rate of discovery of new IPs becomes effectively zero. We seed the crawler’s
search with a list of peer IPs we collected by executing an instance of the Storm binary
within a Qemu honeypot [26] running Windows XP. Through thisprocess, we gathered a
set of∼ 4, 000 initial IPs for the older Storm botnet and∼ 2, 200 IPs for the encrypted
version.
We perform two types of crawls: afull crawl and azone crawl. The IDs that the crawler
16
CHAPTER 2. ON TRACKING P2P BOTNETS
queries during a full crawl are selected from the entire 128-bit space. On the other hand,
the IDs queried during a zone crawl share the same prefix. For example the0x0A 8-bit
prefix, contains all 128-bit IDs whose most significant eightbits have value0x0A.
While full crawls provide a more complete view of the network, they are resource
intensive in terms of the network traffic they generate and the amount of storage required
for the results. Moreover, they require hours to complete during which time the network’s
membership might change. On the other hand, 8-bit zone crawls typically finish in 10
minutes and generate significantly fewer queries (the reduction is proportional to the size
of the zone queried).
2.3 Results
We present results derived from a measurement study conducted over a period of about
three months (11/09/2007 -1/29/2008), using the methodology described above.
2.3.1 Node ID distribution
As previously explained, full crawls are slow, resource intensive, and potentially inac-
curate. Therefore, we prefer to use zone crawls to estimate the number of peers within
these zones and extrapolate the results to the full population. However, in order to do so we
need to ensure that the measured zones are representative ofthe larger network. Given that
Storm node IDs are generated by a cryptographically secure hash function (i.e. MD4), they
17
CHAPTER 2. ON TRACKING P2P BOTNETS
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0
% o
f pee
rs
zones
0
0.5
1
1.5
2
2.5
0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0
% o
f pee
rs
zones
Figure 2.1: The distribution of Storm bot IDs over the 128-bit hash space for (a) origi-nal Storm botnet (b) encrypted Storm botnet. The results in this figure are based on datacollected on 11/19/07.
should be uniformly distributed over the 128-bit address space. Furthermore, results from
other Kademlia-based networks have experimentally verified the existence of this uniform
distribution [27]. Nonetheless, we conducted full crawls of both versions of the Storm net-
work (i.e., the one with no encryption and the latter one which uses the XOR encryption)
to verify this conjecture.
Figure 2.1 presents the results of two such full crawls performed on 11/19/07 (we ob-
served similar patterns during other dates). Specifically,the left chart presents the dis-
tribution of IDs in the original Storm network, while the chart to the right presents the
ID distribution for the network that uses encryption. Whilethe distribution of IDs in the
encrypted botnet is approximately uniform, IDs in the original network display marked
non-uniformities with recurring ramp-like structures that repeat at the beginning of each
3-bit zone. We defer the discussion about the underlying causes of this surprising non-
uniformity until Section 2.3.3. For now, we use this result to select the length of the zones
18
CHAPTER 2. ON TRACKING P2P BOTNETS
to crawl.
2.3.2 Population Estimates
For the older botnet, because the ID distribution has a regular pattern which recurs
in every 3-bit zone, we infer the size of the overall population by crawling 3-bit zones.
Specifically, we count the number of IPs whose correspondingIDs are within the zone
and then multiply this count by the total number of zones. We chose the two three-bit
zones: 0x00/3 and 0x80/3. We selected these two zones because in Figure 2.1, the 0x00/3
zone accounts for the highest percentage of Storm peers, while the 0x80/3 represents the
average case. For the encrypted version too, we followed a similar procedure. Figure 2.2
presents the population estimates we derived by crawling both the botnets. We see that the
average population of the botnet is about 400,000 for the older botnet and about 50,000 for
the encrypted version. These values are in accordance with estimates from Microsoft (∼
500,000) [28] for the older botnet.
We further categorized the end-hosts discovered during thezone crawls based on their
country of origin. To do so, we utilized the maxmind database[29] to map IP addresses
to countries. Figure 2.3 presents the 15 countries with the highest number of infected
nodes. For the older botnet - the continent with the highest percentage of infected hosts
was North America, with the United States contributing approximately 30% of the peers.
This population distribution deviates from the one observed for the popular KAD P2P file
sharing network, which is also based on Kademlia [27]. For the encrypted version of the
19
CHAPTER 2. ON TRACKING P2P BOTNETS
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
11/20/07 12/08/07 12/26/07 01/13/08 01/31/08
Pop
ulat
ion
Date
0
10000
20000
30000
40000
50000
60000
70000
11/20/07 12/08/07 12/26/07 01/13/08 01/31/08
Pop
ulat
ion
Date
Figure 2.2: Population estimates of the botnets from 11/09/2007 - 1/29/2008 for the (a)Older storm botnet (b) Encrypted storm botnet.
botnet, the proportion was more uniform with approximately10% of the peers in North
America. Furthermore, the Figure 2.3 also indicates that inboth cases<1% of IP addresses
we encountered could not be resolved by the database (listedasNA). This set consisted of
private, multicast, loopback, and unallocated/reserved IP addresses. The existence of such
addresses was unexpected because Storm uses an IP query-response mechanism to traverse
NAT boxes and it cannot communicate with unreachable/private IP addresses. Nonetheless,
we do not include these addresses in the population estimates presented above.
2.3.3 Relationship between peer addresses and identifiers
The unexpected non-uniformity of the ID distribution lead us to investigate the IDs
associated with each IP address. We found that approximately 1% of IP addresses we
encountered in the original Storm botnet, consisted of private, multicast, loopback, and
unallocated/reserved IP addresses. The existence of such addresses is unexpected because
Storm, as described in Section 2.1.1, uses an IP query-response mechanism to detect the
20
CHAPTER 2. ON TRACKING P2P BOTNETS
0
5
10
15
20
25
30
US MX RU DE AR IN TR CA PE FR PL BR KR GB ES NA
% o
f inf
ecte
d no
des
country
0
2
4
6
8
10
12
14
US TR RU PL MX EG MA KR AR DE PE IN CN RO BR NA
% o
f inf
ecte
d no
des
country
Figure 2.3: Top 15 countries in which peers are located percentage wise. The last barNA(Not Available) comprises of non-publicly routable IP addresses. (a) Older storm botnet(b) Encrypted storm botnet.
existence of NAT gateways and thus does not normally announce unreachable/private IP
addresses.
To our surprise, we found that this group of IP addresses is associated with 45% of the
unique peer IDs we recorded, even though it accounts for<1% of the total population of
nodes. In other words, 45% of the<IP address, UDP port, Node ID> tuples
stored in the Storm network point tounusableIP addresses. Furthermore, as Figure 2.4.(a)
illustrates, the IDs associated with these IP addresses arethe main contributors to the non-
uniformity shown in Figure 2.1.(a). As a matter of fact, the overall identifier distribution
becomes considerably more uniform after removing these IDs(see Fig. 2.4.(b)).
The most plausible explanation for the existence of these IDs is that some peers are
deliberately injecting ’bogus’ identifiers that point to unusable address in an attempt to
poison the DHT. First studied by Lianget al. [23], index poisoning refers to the process
of inserting a massive number of invalid records into a DHT’sindex, in an attempt to slow
21
CHAPTER 2. ON TRACKING P2P BOTNETS
0
0.5
1
1.5
2
2.5
0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0
% o
f pee
rs
zones
0
0.5
1
1.5
2
2.5
0x00 0x20 0x40 0x60 0x80 0xa0 0xc0 0xe0
% o
f pee
rs
zones
Figure 2.4: (a) Distribution of IDs attributed to unusable IP addresses in the original Stormnetwork. (b) The distribution of IDs attributed to valid IP addresses in the same network.The x-axis represents the 128-bit hash space.
down lookups. In fact, other researchers have previously speculated that index poisoning
could be an effective attack against the Storm botnet [22].
While the ramp-like structure is largely absent from the encrypted Storm botnet, we
discovered other anomalous patterns common to both botnetswhich indicate possible in-
stances of trackers/monitors. Specifically, we investigated whether IP addresses were as-
sociated with multiple IDs. As Figure 2.5 illustrates, for the original Storm botnet approx-
imately 85% of the addresses are associated with a single ID.In the case of the encrypted
version, 85% of the addresses are associated with<10 IDs. However, in both cases a very
small percentage of IP addresses (< 1%) are associated with a large number of IDs, some
of them in the thousands.
We note that while multiple infected hosts behind the same NAT device would advertise
the same IP address, they would use different ports and therefore do not contribute to the
phenomenon shown in Figure 2.5. Moreover, while we noticed particular periods during
22
CHAPTER 2. ON TRACKING P2P BOTNETS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
CD
F
Number of IDs publicized
Unencryptedencrypted
Figure 2.5: Cumulative density function of the number of IDsassociated with a singleIP address and port. Unreachable/non-routable IP addresses were not included in thisdistribution.
which Storm specimens published different IDs every 10 minutes (typically when their
searches failed), that behavior cannot account for the verylarge number of IDs shown
in Figure 2.5. The reason is that stale IDs are removed after 24 hours from the Overnet
DHT, while consecutive crawls executed 24 hours apart consistently registered the same IP
addresses publishing large number of IDs. We are thus left with the only explanation that
some nodes deliberately inject multiple IDs in an effort to interfere or to monitor the Storm
network.
2.3.4 Data from Spam Block Lists
To buttress our claim that nodes publishing a large number ofIDs are not legitimate
Storm nodes, we use an out of band mechanism to verify whethera given IP address truly
hosts an instance of Storm. We leverage the fact that Storm sends gigantic amounts of
unsolicited emails [30]. Therefore, one would expect that active Storm nodes will be listed
23
CHAPTER 2. ON TRACKING P2P BOTNETS
in spammer blacklists. This intuition was verified by Ramachandranet al., who found
that presence in a blacklist is reasonably accurate in predicting whether a host is sending
unsolicited email [31]. Given this result, we use the simpleheuristic of classifying the
IP addresses found in a zone crawl that are also listed in spammer black lists1 as infected
hosts. We then plot the percentage of blacklist-identified hosts versus the number of IDs
associated with each host.
As Figure 2.6 suggests, the percentage of blacklisted nodesdecreases with the number
of IDs the nodes publish. This trend suggests that nodes which publish numerous IDs
do not seem to participate in the malicious activities of thebotnet and therefore are not
likely to be infected hosts. Further confirmation of the conjecture that the IPs advertising
a large number of IDs actually host trackers, is the observation that these IPs belong to
organizations which have traditionally been involved in security research. In fact, some of
these organizations have even published media reports regarding the Storm phenomenon.
Finally, a few of the nodes associated with large numbers of IDs, explicitly published their
IDs even to our crawler. This is despite the fact that our crawler does not insert itself in
the P2P network, but merely queries it. Legitimate Storm nodes we tested against, did not
contact our crawler with such messages.
1We use a number of popular DNS blacklists (CBL [32], TQMCUBE [33] and UCEPROTECT [34]). Weallow two to three days for the collected IP addresses to appear in these blacklists.
24
CHAPTER 2. ON TRACKING P2P BOTNETS
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000
Bla
cklis
t occ
urre
nce
freq
uenc
y
Number of IDs publicized
Figure 2.6: Correspondence between the number of IDs published by an IP address andoccurrence in spam black lists.
2.3.5 Discussion
In the previous section, we have shown that it is simple to identify Storm track-
ers/monitors by exploiting the relation between node IDs and the IP addresses they publish.
By polluting the DHT index with a large number of IDs, the trackers can redirect P2P
queries to themselves. Doing so, allows them to answer queries with fake results, thus
interfering with the botnet’s operation. Furthermore, thesame mechanism can be used
to measure the size of the botnet. For example, the original Storm bots co-existed with
non-infected hosts using the same Overnet network. Since every Storm bot queries for a
known set of search keys every day, monitoring this key-space can provide an estimate of
the actual number of Storm bots. Alternatively, one could track P2P botnets by actively
crawling the DHT, as we have done in this paper. However, the nodes that perform these
crawls can also be detected due to the anomalously large numbers of queries they generate.
The unfortunate implication of these results is that botmasters will eventually notice
the existence of nodes interfering with their networks, probably using techniques similar
25
CHAPTER 2. ON TRACKING P2P BOTNETS
to the ones presented in this paper. They can then launch attacks against the nodes which
appear to monitor/poison the botnet. Such a counter-attackis not far-fetched considering
that the Storm worm is believed to have launched similar aggressive retaliatory attacks in
the past [13].
One potential solution to this predicament is to use multiple trackers, thereby reduc-
ing the number of IDs advertised by each tracker. As Figure 2.5 suggests, increasing the
number of trackers by a factor of∼50, would reduce the suspicion on a particular tracker.
In the case of nodes that crawl the DHT, we observed that on theaverage, only∼10% of
the Storm nodes replied to our queries. This could be becausethe rest were behind NAT
gateways or firewalls. For this reason our crawler queries the reachable nodes multiple
times during a single crawl. We found that on average a globally-reachable Storm node
receives 23 times more queries while a crawl is in progress, compared to its normal query
load. This wide difference in query load can then be used to notify a botmaster that a scan
is in progress and identify the set of nodes participating inthis scan.
The scale factors presented suggest using multiple co-operative trackers/crawlers could
alleviate the problem of monitor identification. An in-depth study of these techniques is
deferred for future work.
26
CHAPTER 2. ON TRACKING P2P BOTNETS
2.4 Related Work
The majority of botnets today use the Internet Relay Protocol (IRC) to disseminate
commands to individual bots. IRC’s centralized architecture allows snooping of the C&C
channels, thereby potentially revealing botnet membership and commands passed onto the
bot armies [4]. Due to these shortcomings, botmasters have innovated by migrating to C&C
protocols that are harder to detect and infiltrate.
The Storm worm is a prime example of this evolution in the botnet ecosystem, due
to its DHT-based C&C protocol and aggressive defensive capabilities. Consequently, it
has been the subject of multiple research reports, most of them focusing on the analysis
of the Peacomm binary. Grizzardet al. present a case study of Peacomm, by running
a single specimen of the infection in a contained honeypot environment [22]. Detailed
descriptions of the multiple techniques that Storm uses to disguise itself are provided in [7,
9, 10]. Instead, our work presents an analysis of Storm’s DHTprotocol and is the first to
discover traces of what seems to be widespread attempts to poison Storm’s C&C network.
A number of recent measurement studies have focused on the widely deployed KAD
P2P network –another variant of the Kademlia DHT that uses a slightly different routing
table. Stutzbachet al. use a distributed crawler to study the ID lookup performancein
KAD [35]. Steineret al. crawl the KAD network to estimate the lifetime of peer sessions
in this network [27]. The goal of our work is not to measure thelifetime of the infected
hosts, but to measure the actual distribution of node IDs andto provide insights into the
Storm network. Furthermore, we have discovered important differences between the Storm
27
CHAPTER 2. ON TRACKING P2P BOTNETS
and the KAD network, such as the non-uniform distribution ofpeer IDs.
Monitoring the evolution of the Storm botnet has been the subject of multiple recent
research reports [36, 37]. While the crawler described in this paper can be used for the
same purposes, the focus of our work is to investigate whether the monitors themselves can
be externally identified.
2.5 Summary
We present results from a measurement study of the Storm botnet which uses a decen-
tralized P2P infrastructure to coordinate individual bots. Our study revealed unexpected
artifacts in the distribution of node IDs which suggest the existence of external entities
aiming to track/monitor the Storm network.
Specifically, we witnessed widespread attempts to poison Storm’s Overnet network by
injecting invalid IDs that point to unreachable IP addresses. Moreover, we found that a
small number of routable IP addresses inject a large number of IDs, most likely in an
attempt to monitor or interfere with the Storm network. While polluting the DHT index is
an effective strategy to deter file sharing networks, as users have to manually sift through
bad search results, its effectiveness in stopping a botnet’s operation is questionable. A
study of the effectiveness of such poisoning techniques in curtailing botnets is an avenue
for future work.
More importantly, trackers that inject IDs or crawl a botnet’s network are easily identi-
28
CHAPTER 2. ON TRACKING P2P BOTNETS
fiable and thus vulnerable to counter attacks by the botnet’soperators. Therefore, there is a
critical need to develop effective P2P tracking technologies which can evade detection by
miscreants.
Acknowledgements
We would like to thank Razvan Musaloiu-E., for his help in deploying the Storm botnet
crawler.
29
Chapter 3
On Using Mobility to Propagate
Malware
Mobility pervades networked devices today. For example, millions of users access
the Internet through laptops and PDAs equipped with WiFi cards connected to thousands
of Access Points (APs) located on campuses, coffee shops, airports, etc. This increase
in connectivity however comes at a high price – failure to secure these communication
channels provides a new propagation vector for spreading self-replicating malicious code.
As a matter of fact, the exploitation of these channels is notjust our speculation: variants
of the Zotob/Mytob worm are suspected to have used physical movement of computers
across network domain boundaries as a propagation vector [15]. More recently, a series
of malware that attempt to exploit Bluetooth connections asan infection mechanism were
reported in the media [38]. The accepted practice of protection such worms today, is to
30
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
place mobile nodes in a de-militarized zone (DMZ), separatefrom the rest of the network.
In such a scenario, all communication between the mobile nodes and the wired nodes passes
through a firewall. However, mobile nodes can still infect each other through contacts
within these de-militarized zones.
Unfortunately, modelling efforts have not followed the pace of malware evolution as
most previous work describes how infections spread over wired networks. To better under-
stand this impending threat, we develop a concise analytical model that predicts the speed
of infections over populations of nomadic users traversinga collection of network access
points. The accuracy of the model is validated through simulations driven by realistic mo-
bility models, drawn from university-wide traces at Dartmouth College [39]. We found
that, in networks with thousands of users and hundreds of APsthe infection can reach 65%
of the total population within only one day, a relatively short time considering that infec-
tions follow the slow pace of node movements across network domains. Furthermore, if
mobile nodes are allowed to infect co-located nodes connected to the wired network, a sce-
nario modelling imperfect DMZs, we observed that even a small proportion of vulnerable
mobile nodes can propagate the infection to the majority of the network domains within a
single day.
Due to the high propagation speed of these worms, human defense mechanisms are
rendered implausible. Moreover, the threat from this classof infections stems from the
fact that mobile nodes trivially bypass existing perimeterdefenses, such as firewalls. Since
cross-domain transfer of the infection is accomplished by the physical migration of infected
31
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
nodes, it is difficult to contain them, when no controls existto police the movement of nodes
across domains. Such gaps in network defenses can lead to global worm outbreaks. Finally,
the detection of these worms is challenging due to their stealthiness. This characteristic is
a consequence of the fact that the majority of current detection techniques rely on traffic
anomalies measured at network monitors (network telescopes[40]). Unfortunately since
mobile infections scan within the domains of infected nodes, suspicious probes on tele-
scopes deployed at remote domains would be absent. This observation motivates the need
for developing novel malware containment technologies. One promising direction towards
this goal involves exploiting the spatial characteristicsof the infection. Specifically, we ob-
served that by placing monitors in approximately 10% of the most visited domains, we can
detect the mobile worm before it reaches the majority of the population. While this seems
a straightforward solution to the early detection problem,we argue that monitor placement
is still a challenging problem with many intricacies.
The rest of the chapter is structured as follows: Section 3.1introduces the model for
predicting the spread of infections among a population of mobile users. We compare the
model’s predictions to simulation results driven by realistic mobility traces in Section 3.2
where we also investigate a number of variants of this worm. In Section 3.3 we compare the
mobile worm to atraditional (i.e. globally scanning) worm and provide intuition about the
temporal evolution of the infection by connecting it to the structure of the mobility graph
in Section 3.4. We discuss the issues involved with telescope placement in Section 3.5.
Section 3.6 presents previous models for malware and mobility patterns and finally we
32
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
close in Section 3.7 with future research directions.
3.1 Worm Model
We model infections spreading over collections of mobile users who connect to the
Internet through a revolving set of network access points. This model consists of two
types of entities:(a) network domains through which users connect to the Internetand(b)
mobile nodes,e.g., laptops and PDAs, that are susceptible to infections and move across
these domains. In this context, domains act as mixing regions in which mobile nodes can
reach each other. We assume that an infected mobile node can infect another susceptible
mobile node if they reside in the same domain, even for a shortperiod of time. This is a
realistic assumption because an infected mobile node can eavesdrop on communications
from all the other wireless nodes in the same domain and attempt to infect them directly.
The evolution of an infection can be modelled as a discrete time, replication process
over the setV of vulnerable nodes. We denote the probability that nodei is infected at time
stept by pi,t. Furthermore, letβij be the probability that nodei contacts nodej. Given
these conditions, nodei is not infected at time stept iff it was not infected by time step
t−1 and no infected nodes in the domain it resides in, contacted nodei during the last time
step. Because these events are independent, this probability can be expressed as:
33
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
1− pi,t = (1− pi,t−1)∏
j 6=i
(1− βjipj,t−1)
1− pi,t = 1− pi,t−1 −∑
j
βjipj,t−1
Here, we use the approximation(1− a)(1− b) ≈ 1− a− b whena≪ 1, b≪ 1. Thus
we have,
pi,t ≈ pi,t−1 +∑
j
βjipj,t−1 (3.1)
By representing (p1,t, p2,t, . . .) as a row vectorPt and assigningβii = 1 (i.e., the proba-
bility that a nodei contacts itself is trivially one), we can rewrite Equation (3.1) in a matrix
form as:
Pt = Pt−1M (3.2)
whereM=[βij ] is the system matrix, containing the pairwise contact probabilities. From
the definition ofPt, pi,t is the probability that nodei is infected at timet. Therefore, the
expected number of infected nodes after timet is given by
E[
|I|]
=
|V|∑
i=1
pi,t = ||Pt||1 (3.3)
whereI is the set of all infected nodes. This type of matrix multiplication view of an
infection is common in epidemic modelling (e.g.,[41]).
34
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
We initiate the infection by infecting a single node, sayk. The initial conditions are
then as follows:
pi,0 =
1 if i = k,
0 Otherwise
If multiple nodes are initially infected (also known as patient zeros), the corresponding
indices inP0 are set to unity.
3.1.1 Mobility Model
It is evident that in order to estimate the expected number ofinfected nodes in Equa-
tion (3.3) we need to calculate the contact probabilitiesβij . In turn, these probabilities
depend on the number of domains a node visits and the durationof time that the node re-
sides in each domain. We therefore need a mobility model thatdescribes the movement of
mobile nodes across network domains.
We model the mobility pattern of individual nodes using semi-Markov chains. We
chose the more general semi-Markov model because it was shown that node residence
times do not follow the exponential distribution [42,43], but are better modelled by heavy-
tailed distributions. The state spaceS = {1, · · · ,m} of the homogeneous semi-Markov
chain is the set of all network domains. The transition matrix P describing the chain is then
anm×m matrix, whileD = [di] is anm× 1 vector, which gives the mean residence time
of the node in each domain.
35
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
We can then derive the steady-state transition probabilitydistributionπ by solving the
following set of equations:
π = πP
m∑
i=1
πi = 1
Given the fraction of timeπ that the user stays in each state and the mean residence
times D for each state, it is easy to calculate the steady-state probability πi of the user
staying in domaini:
πi =diπi
∑m
j=1djπj
(3.4)
From Equation (3.4) we can subsequently compute the contactrateβxy between nodes
x andy. This value is equal to the probability that bothx andy are in the same domain at
some point in time. Without loss of generality, we say that when a node is in the “OFF” state
(i.e. it is not operational) then it resides in the domain with index 1. Since, the infection
does not propagate when nodes are not connected, we do not include the percentage of time
in the “OFF” state in the calculation of the contact rates. The contact rates are then given
by:
βxy =
m∑
i=2
πxi π
yi (3.5)
36
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
whereπxi is the percentage of the time spent byx in domaini. We substitute Equation (3.5)
into Equation (3.2) to obtain the number of infectees as a function of time.
The last complication is that Equation (3.2) proceeds on discrete time steps of uniform
duration, while nodes actually have variable domain staying times. We address this dis-
crepancy by using the mean residence time across all domainsas the discrete time step in
Equation (3.2). While doing so compromises the accuracy of the analytical model, as the
simulation results from Section 3.2 demonstrate, even withthis compromise the model is
able to accurately track the infection’s evolution.
3.2 Evaluation
We derive the parameters of the mobility model described in the previous section from
traces of actual mobile user behaviors, available from Dartmouth college [39]. Each trace
is a time sequence of the access points the mobile users visit(identified by their MAC
addresses). Traces also contain the special ’OFF’ location, signifying a user’s departure
from the network. The trace we use contains 626 different access points and tracks the
movement of mobile users from 9/23/2003 to 12/10/2003. Approximately 6% of the users
in our trace visited just a single domain before entering the“OFF” state. We removed
such users, since states in their semi-Markov chain are not recurrent and their steady state
probabilities in states other than the “OFF” state are trivially zero. In all, we had 6101 users.
We assume that all the mobile users in the system are vulnerable. We observed similar
37
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
infection curves when only a fraction of the mobile users were vulnerable. Furthermore,
the infection model can easily incorporate scenarios in which only a subset of the mobile
nodes are vulnerable by appropriately defining the set of vulnerable nodes,V . The mean
domain residence time of the users is approximately 67 minutes. We use this value as the
discrete time step in Equation (3.2).
3.2.1 Mobile node infection
We compare the model’s predictions with results provided bydetailed simulations. The
custom simulator we developed emulates the movements of mobile users over the same
collection of APs and tracks the evolution of the infection after an initial node (Patient
zero) is infected. As before, we assume that the infection passes from an infected node to
any other node that resides in the same network at the same time. We ran 100 simulations,
each time randomly choosing a different initial node to infect.
Figure 3.1 graphs the evolution of the infection as a function of time. In addition to the
infection curve predicted by the analytical model, we present three representative simula-
tion runs. These curves represent the 5th, 50th, and 95th percentiles across all simulations,
where rank is calculated based on the time when 70% of vulnerable hosts are infected. In-
tuitively, these curves represent a slow, average, and fastinfection instance depending on
which node was infected first.
First, we note that the model provides a decent approximation of the average infection
evolution, faithfully tracking the curves that represent the simulations. Furthermore, the
38
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60
Fra
ctio
n of
infe
cted
nod
es
time(hours)
Sim 5%Sim 50%Sim 95%
Model
Figure 3.1: Percentage of infected users as a function of time as predicted by the analytical
model and as demonstrated by simulation.
infection spreads to approximately 60% of the users within asingle day. Given that the
worm requires under a day to infect the majority of the population, we experimented by
starting the infection at different days during the period covered by the network trace. In
all cases we observed patterns very similar to those in Figure 3.1. We also found that the
evolution speed varied depending on the time of day when the first node was infected.
Worms that started during the daytime spread faster than those started at night. This is due
to the decreased movement of nodes during the night hours.
3.2.2 Mixing mobile and static nodes
So far we have assumed that mobile users cannot infect nodes connected to the static
(wired) network, This model corresponds to current security practices according to which
WiFi APs are separated from the rest of the network (e.g.,a company’s intranet) by fire-
39
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
walls. However, firewalls are complex devices that are notoriously difficult to configure.
Therefore, it is possible that a misconfigured firewall wouldallow infected wireless devices
to contact hosts residing in the static part of the network. More commonly, laptops can
connect directly to the static portion of the network after they have roamed across several
wireless domains (e.g.,during a business trip) effectively bypassing the barrier between the
static and mobile compartments of a network domain.
In this scenario, static hosts can be infected by mobile nodes and subsequently carry
the infection to other vulnerable nodes. Therefore, it is nolonger necessary for mobile
nodes to simultaneously reside in the same domain for the infection to spread; a mobile
node entering a network domain can contract the infection byinfected static nodes in that
domain. In order to understand how these infections spread,we modified the original
simulator to assume the worst case scenario, wherein an infected mobile node instantly
infects any domain that it enters. The “instant infection” assumption is valid even for
a uniform scanning worm (i.e. which follows a naive strategy of random scanning and
therefore one of the slower spreading worms). Even with a scan rate of 10 scans/sec and
domains with as few as 10% vulnerable nodes, one static node on the average is infected
within the first second from the entry of an infected user to the domain.
Figure 3.2.(a) presents the number ofnetwork domainsinfected as a function of time
when mobile nodes can infect the domains they visit. The infection spreads to about 65%
of the domains within a day. It then slows down considerably and takes a long time to
infect the remaining domains. This result might seem straightforward, given that 65% of
40
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 5 10 15 20 25 30 35
Fra
ctio
n of
infe
cted
hos
ts
time(hours)
Sim 5%Sim 50%Sim 95%
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60
Fra
ctio
n of
infe
cted
hos
ts
time(hours)
Sim 5%Sim 50%Sim 95%
Figure 3.2: (a)Rate of domain infections as a function of time with the total mobile popu-
lation (b) Rate of infection with only 25% of the mobile nodes
mobile nodes contract the infection within one day. In orderto investigate the relationship
between the number of mobile nodes carrying the infection and its spread over the set of
network domains, we repeated the previous experiment, witha randomly selected subset
of 1500 wireless nodes (25% of the original population). Thesurprising result, as Fig-
ure 3.2.(b) indicates, is that infection rates in this case are comparable to the previous case,
i.e. the infection reaches∼ 60% of the domains within a day. This result indicates that the
worm speed is not significantly hampered by the significantlysmaller set of cross-domain
carriers. This phenomenon can be explained by the association graph usually observed in
social networks [44]. In that context, as well as in the context of network domains visited
by mobile hosts, domain popularity has been shown to follow aheavy tailed distribution,
whereas a small number of domains are extremely popular followed by a large number of
less popular domains. As a result, the smaller subset of nodes is still likely to frequent at
the very popular domains thus fuelling the growth of the infection.
41
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
3.3 Detection
Thus far we have shown that a mobile infection can take up to a day to affect a sig-
nificant portion of the vulnerable population. Although this is fast enough to make human
defense mechanisms implausible, it is considerably slowercompared even to the naıve uni-
form scanning strategy, or more sophisticated variants such as flash worms that can spread
over the entire Internet in a few minutes [45].
The fact that such worms spread more slowly might lead to the conclusion that they
areeasierto contain. This, however, is false. On the contrary, mobileinfections are more
difficult to detect using conventional approaches, such as distributed network monitors [46,
47]. In the paragraphs that follow, we explain the underlying reason for this negative result.
3.3.1 Detection Speed
We compare the expected time to detect a mobile infection to the average detection time
of a uniform scanning worm. Here we assume that a single network telescopes is used to
detect the infection. We define detection time as the time elapsed from the first infection
until the first probe arrives to the address space monitored by the telescope(s). Suppose,
that the telescope covers a large fraction,α = 0.5, of the IP space used in the network
domain where it is deployed. Then, the expected timeT to detect the first instance of the
infection for a uniform scanning worm is given by:
42
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
H(T ) =∫ T
0I(t) · s dt
≈∫ T
0esft · s dt
= Nα
⇒ T = 1
s·f ln(N ·fα
+ 1) (3.6)
whereH(t) is the number of IP addresses scanned by all the infected nodes in [0, t], s is
the scan rate,N is the total number of domains, andf is the average density of vulnerable
nodes.
Substituting conservative values fors = 20 scans/min (the Witty worm had a scan rate
of roughly 1200 scans/min [48]),N = 1000, andf = 0.01 in Equation (3.6) we find that a
uniform worm will be detected within 15 minutes on the average. By this time the worm
has spread to less than 2% of the vulnerable population (calculated from the equation for
the uniform scanning worm). Furthermore, the placement of the telescope is immaterial
to the detection time. Thus, we conclude that such a telescope can be an effective early
warning device for typical worms.
On the other hand, since mobile worms scan only their local network, detection time
is governed by the speed with which infected mobile nodes enter the domain where the
telescope is located. Considering the same (randomly placed) single telescope, detection
will occur when the worm has spread to half of the domains on the average. Figure 3.2
provides the time for the worm to spread to 50% of the vulnerable domains as∼ 15 hours.
43
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
Within this time, the worm infection has already taken off, infecting a large number of
hosts. Once the worm enters the domain which contains the network monitor, detection is
much faster. On the other hand, since detection time is dominated by the time necessary for
the worm to enter the domain, using larger telescopes withina domain does not significantly
reduce detection speed.
In short, unlike traditional uniform-scanning worms, telescope size is not important
and random placement is of little use. On the other hand, given that the worm first infects
popular domains first, it is prudent to place worm monitors inthose domains.
3.4 Spatial evolution
Until now we have investigated the temporal behavior of the infection. However, an
equally interesting aspect is the infection’s spatial evolution, that is how the infection
spreads over the collection of network domains the mobile nodes visit. We note that Fig-
ures 3.1 and 3.2 flatten out considerably after an almost vertical growth during the middle
phase of the evolution graph. This behavioral change can be explained by dividing the
spatial evolution of the infection into a number of distinctphases. The infection initially
“moves” in the direction of domains which are extremely popular, since many nodes visit
them. This is the slow take-off phase. These popular domains(we call themhubs) are
closely connected by the group of mobile nodes which frequent them, thus forming adense
core of the network graph. When the infection reaches this core, an exponential increase
44
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
in the number of infected hosts occurs, as the majority of vulnerable nodes frequently visit
the core. Finally, the infection gradually slows down afterit has consumed the core and
extends towards domains with low contact rates (i.e. unpopular domains). Figure 3.3 illus-
trates this phenomenon where it is clear that popular domains are infected within the first
few hours of the infection.
0
2
4
6
8
10
12
10 20 30 40 50 60 70 80 90 100
Pop
ular
ity (
node
-hou
rs)
x 1
0^7
infection time (hours)
Figure 3.3: The first time an infected node is seen at a networkdomain as a function of
the domain’s popularity, defined as the number of cumulativenode-hours occu-
pancy of a domain.
3.4.1 Popularity
We define the popularity of a domain as the cumulative number of node-hours that
nodes spend in that domain. This definition accounts for boththe distinct number of nodes
visiting the domain as well as the length of time a node resides in the domain.
Intuitively, placing network monitors in the most popular domains yields the earliest de-
45
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
9
10
11
12
13
14
15
16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Tim
e of
det
ectio
n (h
ours
)
% of domains as monitors
Figure 3.4: Detection time when monitors are deployed in thetop x% of the domains.
tection times. To quantitatively measure the effect of placing multiple monitors, we placed
monitors in the topx% of the domains and measured the detection times. As Figure 3.4
shows installing monitors in 10% of the domains reduced the detection time to about 10
hours. During this time the worm has spread to less than 10% ofthe hosts (as seen from
Figure 3.1). Installing additional monitors provides onlymarginal benefits, reducing in the
limit the detection time to a little over 9 hours.
3.5 Discussion
Deploying wireless network monitors may involve modifyingAPs to scan through
packets they forward looking for traces of malware or deploying honeypots acting as de-
coys. As we showed, placing such monitors in the top 10% of thedomains can help detect
the worm early enough. However, this strategy in itself is not sufficient to guarantee early
detection. We present two arguments to support this claim.
46
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
3.5.1 Popularity is dynamic
First, we investigate how domain popularities change over time and the effect these
changes have on detection time. For this purpose we use the access points from the previous
dataset [39] to calculate the popularity of each domain on a weekly basis. We then choose
an initial set of the 50 most popular APs (∼10% of the total AP population) during the
first week of the network trace and measure how this set compares with the set of top 50
APs for every other week. The similarity between the first andevery other weekly set is
estimated by calculating the dot product between the two sets and dividing the result by 50.
In this case a product of one indicates that the sets are identical, while zero indicates that
no common members exist between the two sets.
Figure 3.5.(a) plots how the similarity between the top 50 APs evolved during year
2004. It is evident that there are wide variations with two prominent dips around weeks
30 and 50. Closer inspection of the CRAWDAD dataset revealedthat during the Fall and
Spring sessions, the APs in the residential buildings were the most popular. On the other
hand, APs in the academic buildings and athletic centers were highly ranked during inter
sessions, explaining the aforementioned changes. Figure 3.5.(b) shows the corresponding
median worm detection time over time, when monitors are statically placed in the top 50
domains according to the popularity results of the first week. While it may seem that the
difference in the detection time is only a matter of two hours, varying between 10.5 and
12.5 hours, the effects of this difference are dramatic. As Figure 3.1 indicates, this disparity
results in a infection spread of<5% in the case of 10.5 hours, as opposed to∼30% when
47
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
the detection time is 12.5 hours. Thus, reducing the detection time window is crucial to
providing sufficient time if the worm defenses are to be effective.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60
Sca
led
dot p
rodu
ct
Week index
10
10.5
11
11.5
12
12.5
13
13.5
0 10 20 30 40 50 60
Det
ectio
n tim
e (h
ours
)
Week index
Figure 3.5: (a) Similarity between the popularity of the top50 domains on a weekly basis
for 2004 (b) Median detection time if the monitors are deployed statically.
3.5.2 Evasive worms
The second reason why static placement of monitors is insufficient, is that worms can
potentially detect their presence and avoid the networks inwhich these monitors are de-
ployed. Rajabet al. have presented an efficientprobe-response attackthat can be used to
discover the locations of network monitors deployed on the (wired) Internet [49]. A similar
technique could potentially be applied in the context of mobile infections. In this case,
worm instances probe the domain they currently reside, using standard network tools such
as ping and ARPs, or even passively eavesdrop all ongoing communications to the AP. If a
domain is believed to host a monitor the worm will not attemptto infect any mobile nodes
48
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
in that domain, thus avoiding detection.
On the other hand, if avoiding popular domains, in which monitors are deployed, slows
down the infection to the point where human intervention is practical, then the threat posed
from theseevasiveinfections is minimal. To verify whether this is true, we simulated such
an evasive worm that does not try to infect the 50 most populardomains, and measured its
infection speed. Unfortunately, as Figure 3.6 indicates, the infection rate is still significant,
with 60% of the hosts infected within 3 days on the average.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 10 20 30 40 50 60 70 80
Fra
ctio
n of
infe
cted
hos
ts
time(hours)
Sim 5%Sim 50%Sim 95%
Figure 3.6: Worm evolution when the worm is inactive in the top 50 domains.
From the two arguments presented above it is clear that placing monitors in the most
popular domains is not a complete solution to the problem of early detection.
3.6 Background and Related Work
A large volume of research has focused on modelling Internetworms. Among these,
the classic homogeneous worm model assumed all-to-all nodeconnectivity and that every
49
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
susceptible node was a target of equal probability [50]. More recent models accounted
for non-uniform scanning strategies [51], as well as for thefact that node population is
not uniformly distributed over the IP address space [47]. However, much of the prior
work ( [41, 52, 53] among many others), primarily considers how malware propagates in
wired networks. Instead, we explore how mobility can facilitate the spread of infections
among groups of nomadic users traversing different networkattachment points such as
WiFi Access Points. In this case unlike previous scenarios,each infected node has a time-
varying infection transmission probability depending on its local scope.
In the context of mobile networks, Andersonet al. derived the speed of mobile worms
through simulations [54]. While our results seem to be in broad agreement, we focus our
attention on the actual infection evolution, so as to infer the worm characteristics. Similar
trace-driven studies covering infections over Bluetooth networks were performed by Suet
al. [55]. Unlike those previous studies, which are limited to simulations performed using
a particular trace, we propose a general analytical model that predicts the evolution of in-
fections over a wide range of mobility patterns. Epidemic spreading in ad-hoc networks
has been studied by Mickens and Noble in [56]. That work explained why traditional epi-
demic models fail in the case of mobile networks and proposeda new framework for such
networks. While that study focused on worms spreading within a single ad-hoc wireless
network, our model explains how infections are carried across a variety of networks by the
physical movement of mobile users.
The mobility model we use is similar to the semi-Markov modelpresented in [57].
50
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
Lee et al. developed a cumulative model for different user groups to obtain the AP-user
mobility patterns. Instead, we model the mobility patternsof individual users. We choose
to do so, because the derived mobility model is then used to calculate the contact rates
between mobile node pairs. This factor determines the rate at which the infection travels
among individual nodes.
Today, it is generally considered good practice to place mobile nodes in a DMZ sep-
arated from wired nodes. Various enterprise solutions exist for doing so,e.g., Cisco’s
network admission control [58]. We believe that these perimeter defenses by themselves
are insufficient and a more fine-grained approach is needed todetect and contain mobile
worms.
3.7 Summary and Future Directions
We presented and validated an analytical model that describes the evolution of worms
that exploit node mobility to propagate. We evaluated infection speeds in different scenar-
ios: first, when mobile users can only infect each other as they move across a collection
of network domains and second when infections can spread from mobile users to static
nodes. Our ultimate goal is to use this model to design effective detection and containment
mechanisms for this novel category of worms. While we touched upon the difficulties of
the detection mechanisms for this type of infections, we will discuss it in further detail in
the upcoming chapter.
51
CHAPTER 3. ON USING MOBILITY TO PROPAGATE MALWARE
Even with effective detection mechanisms, the feasibilityof policing nodes as they en-
ter popular domains is not straightforward. Numerous practical concerns for containment
mechanisms designed for mobile infections must be addressed, including how to exploit
topological information to limit the damage from potentially infected nodes, how to appro-
priately apply the notion of hard-LANs [59] in this setting,and how to track (in a tamper-
resistant manner) the movement of nodes across network domains.
Acknowledgements
This work was supported in part by National Science Foundation grant CNS-0627611.
We gratefully acknowledge the use of trace data from the CRAWDAD archive at Dartmouth
College.
52
Chapter 4
On the Detection and Origin
Identification of Mobile Worms
In the previous chapter, we discussed the various difficulties in the detection of mo-
bile worms. The detection of these worms is challenging because the majority of tech-
niques against zero-day infections rely on recognizing anomalous patterns in inbound traf-
fic (e.g., [46] among others), or outbound DNS, ARP, or failed connection requests from
local hosts [60, 61]. On the other hand, a mobile worm can find victims by eavesdropping
on the radio channel, thus generating no scanning/ARP/DNS traffic. Moreover, the alterna-
tive solution of using honeypots for detection is also ineffective because it requires placing
honeypots in the majority of the domains [62].
In this chapter, we present two mechanisms for countering mobile worms. The first
mechanism detects the existence of a worm spreading througha collection of wireless do-
53
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
mains while the second identifies the origin of the worm. Doing so involves identifying
the node(s) that initiated the infection as well as the nodesinfected during the very early
stages of the epidemic. In turn, origin identification enables further investigation into the
underlying causes and techniques used to breach the networks defenses and can provide
information relevant to law enforcement. These node identities can also be used to con-
tain the infection, by blocking their traffic or by automatically generating attack signatures
based on the traffic they transmit.
Both proposed mechanisms extend the Random Moonwalk technique [63] and only re-
quire network flow records consisting of the start, duration, source, and destination of all
flows within a wireless network domain. These flow records arecollected at every domain
and are either aggregated to a centralized database, or are available through a federated
database, similar to the network forensic alliance (NFA) proposed in [64]. We first show
that the original moonwalk is ineffective against mobile infections and then present two
new heuristics that can detect and identify such infections. We evaluate the performance of
the proposed algorithms through simulations driven from network traces collected from a
university-wide wireless network. Our results show that a mobile infection can be reliably
detected before it infects 10% of the vulnerable populationin a network with hundreds
of domains and thousands of mobile nodes. Furthermore, the proposed identification al-
gorithm limits the search for the initial infection victimsto within 2% of the mobile node
population. Working in concert, the two algorithms we present can effectively protect users
against stealthy mobile worms.
54
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
This chapter has six sections. In the following section, we review the standard moon-
walk algorithm. Section 4.2 describes how the moonwalk algorithm can be modified for the
online detection of mobile worms. In Section 4.3 we show how to trace the evolution path
of a mobile worm. Finally, Section 4.4 presents related workand we close in Section 4.5.
4.1 Background
As part of our previous work, we showed that mobile infections can spread through
tens of thousands of victims located in hundreds of domains within a day [62]. We also
showed that a mobile infection initially “moves” towards highly popular domains, because
many nodes visit them. When the infection reaches these popular domains, its growth rate
rapidly increases and in the final phase it slowly spreads to the remaining domains.
Intuitively, placing network monitors and honeypots in themost popular domains yields
the earliest detection times. In fact, we showed that by installing monitors in∼10% of the
most popular domains, one could detect the infection while it is still in its early phase.
Deploying such wireless network monitors involves modifying APs to inspect the packets
they forward or deploying honeypots acting as mobile nodes.Unfortunately, deploying
monitors in the most popular domains is insufficient for a number of reasons. First, domain
popularity changes over time depending on the users’ mobility patterns. Second, mobile
worms can potentially detect the presence of monitors and avoid popular domains in which
they may be deployed. Similarprobe-response attacks, used to discover the locations of
55
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
network monitors deployed on the (wired) Internet have beendiscussed in [49,65]. Mobile
worms could use standard tools such as sendingICMP or ARP requests, or even eavesdrop
to infer the size of a domain and avoid highly popular domains. Finally, worm origin
identification is almost impossible if monitors are not deployed in every domain.
4.1.1 Random Moonwalks
The random moonwalk is a post-mortem method for identifyingthe origins of a worm
attack on the Internet using network flow data [63]. Specifically, given a set of network
flow records corresponding to a host contact graph, a moonwalk starts at an arbitrarily
chosen edgee1 = 〈u1, v1, ts1, t
e1〉, whereu1, v1 are the source and destination andts1, t
e1, the
start and end times of the flow respectively. The next edge backward in time is selected
uniformly at random from the set of edges that arrived atu1 within the past∆t seconds,
i.e. e2 = 〈u2, u1, ts2, t
e2〉 andte
2 < ts1 < te
2 + ∆t. This process continues for a maximum
number of hops or until no prior edge is found. Multiple moonwalks are taken and the
edges that appear with the highest frequency across all moonwalks are computed. These
edges are likely to be the top-level causal edges of the worm tree (i.e. edges that initiated
the infection). The intuition behind this approach is that because worms generate tree-like
contact graphs in which a small number of early malicious edges tend to be responsible
for a large number of edges further down the tree, the initialcausal edges will be traversed
multiple times and will therefore have high occurrence frequencies.
The effectiveness of the moonwalk algorithm decreases as the worm becomes stealthier,
56
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
thus generating smaller amounts of excess traffic. For this reason tracking mobile worms is
especially challenging for the moonwalk algorithm, as infected nodes eavesdrop to discover
victims instead of scanning for them. Because an infected node only seeks victims within
its current domain, it is enough for the moonwalk algorithm to focus on the intra-domain
flows. However, even considering this reduced edge set, edgefrequency is not a reliable
indicator of the initial causal edges of a mobile infection.In fact, as we shall show shortly, it
is impossible to even detect the presence of a mobile infection using the standard moonwalk
algorithm. The underlying reason is that a typical host contact graph is globally sparse but
has considerable local correlation. In other words, the density of flows between nodes
in then same domain is higher than those across domains, for both the worm and non-
malicious traffic. Moreover, global and local contacts are made at different timescales.
While local contacts occur in the timescales of normal host connections, the timescales of
inter-domain contacts are governed by the, usually slower,physical movement of nodes
across domains.
4.2 Mobile Worm Detection
4.2.1 Random Moonwalks and Mobile Worms
To demonstrate the shortcomings of the standard moonwalk algorithm, we simulate a
mobile infection that spreads over a group of mobile nodes traversing multiple network
57
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
domains. We do so using two models: one describing the mobility pattern of nodes across
domains and another reflecting the traffic patterns of mobilenodes within a domain. We
derive the first model from traffic traces collected at the university campus in Dartmouth,
available through CRAWDAD [39]. The trace we use contains 626 domains (i.e., APs) and
over 6,000 nodes and was collected from a campus-wide WiFi network between 9/23/2003
and 12/10/2003. Each trace entry corresponds to the time that a host, identified by its MAC
address, connected to one of the network domains. The trace also includes the special
’OFF’ location, signifying a host’s departure from the network.
To the best of our knowledge, there are no datasets that capture traffic that originates
and terminates within the same wireless domain. For this reason we generate traffic syn-
thetically. Specifically, we build a flow model using measurements of intra-domain traffic,
collected over a period of two weeks from the wireless APs in the Information Security
Institute at Johns Hopkins University. Note that applications such as FTP create two TCP
flows, one for control messages and one for data. We combine all the flows corresponding
to the same transaction into a singlesemanticflow. We then model semantic flow inter-
arrival times using a Lognormal distribution and flow sizes using a bi-Pareto distribution,
as suggested by [66]. The parameters for these distributions are fitted from the collected
packet traces. Finally, we select the size of the moonwalk’stime window∆t to maximize
the walk lengths and set the hop count value to a large value, to allow the moonwalks
to continue as far back as possible. Table 4.1 summarizes allthe parameters used in our
simulations.
58
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
Description SettingNumber of domains 626
Number of mobile nodes 6101Flow Inter-arrival Model Lognormal
Flow Duration Model Bi-ParetoMean Domain Residence Time(TR) 67 min
Mean OFF Time 315 minMoonwalk Window Size(∆t) 300 min
Maximum Moonwalk Hop Count(d) 50
Table 4.1: Simulation Parameters.
Given these parameters, we simulate two scenarios: one withonly normal traffic and
one in which a mobile worm is injected at a random network nodeat a certain point in
time. We let both simulations progress until the time when the worm has infected 65% of
the network’s nodes and then invoke the random moonwalk algorithm for each of the two
scenarios1.
The top panel in Figure 4.1 presents the results of the randommoonwalk algorithm for
the network with no malicious traffic, while the bottom half presents the results when a
mobile infection is injected att = 10, 000 sec (∼ 167 min). The y-axis in these graphs
corresponds to the frequency with which a flow that starts at certain point in time occurs
over the set of random moonwalks performed. Therefore, a large frequency value indicates
an edge that was traversed during multiple moonwalks.
When the moonwalk algorithm executes on an Internet trace that contains an actively
spreading scanning worm, the initial causal edges of the attack have the highest frequencies
creating a pronounced spike in the frequency graph (see [63]). In contrast, as is evident
1Similar results were derived for other infection percentages.
59
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0 100 200 300 400 500 600 700 800
Fre
quen
cy
Time (minutes)
0
0.001
0.002
0.003
0.004
0.005
0.006
0 100 200 300 400 500 600 700 800
Fre
quen
cy
Time (minutes)
Figure 4.1: (a) Random moonwalk on a network with no malicious traffic. (b) Randommoonwalk on the same network when a worm is injected att ∼ 167 min. The y-axisrepresents the frequency with which flows starting at a particular time, appear in the set ofpaths traversed by the moonwalks.
from Figure 4.1, there is no marked increase in edge frequencies when a mobile worm is
spreading. Comparing the two cases, it is difficult to even infer the existence of a worm in
the lower graph.
4.2.2 Proposed approach
As the results from the previous section indicate, edge frequency is not an effective
indicator of infection in mobile networks. Instead, we use adifferent heuristic—the average
60
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
0
1
2
3
4
5
6
7
8
100 200 300 400 500 600 700 800 900 1000 1100
Moo
nwal
k Le
ngth
time (minutes)
Worm (f = 1)Norm
Worm (f=0.75)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
100 200 300 400 500 600 700 800 900 1000 1100
% o
f Inf
ecte
d no
des
time (minutes)
f=1f=0.75
Figure 4.2: (a) Average moonwalk length for a network with nomalicious traffic and anetwork in which a worm is injected att ∼ 420 min. Graphs are shown when 100% and75% of the population is vulnerable (b) Percentage of infected nodes as a function of timefor the same worm.
moonwalk length. The intuition for selecting this attribute is that as the worm spreads, the
host contact graph becomes inherently more dense as infected nodes contact other nodes to
spread the infection. As a result, the length of the moonwalktends to increase. Moreover,
these contact paths tend to span multiple network domains, which is unusual for normal
traffic. Based on these observations, we posit that a worm canbe detected by noticing a
steep increase in the average moonwalk length.
To test the validity of this thesis, we compute the average moonwalk length for the
61
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
0
1
2
3
4
5
6
7
8
100 200 300 400 500 600 700 800 900 1000 1100
Moo
nwal
k Le
ngth
time (minutes)
HighNorm
Low
Figure 4.3: Average moonwalk length for a network with different volumes of normaltraffic. The curve labelled ’High’ corresponds to double thevolume of traffic in ’Norm’,while the curve labelled ’Low’ represents a scenario in which the traffic is halved.
simulated network presented in the previous section and observe whether the introduction
of a worm creates a marked increase in the average moonwalk length. To make the worm
even more stealthy, we even allow infected hosts to avoid contacting nodes which they had
either previously subverted or attempted to subvert.
Figure 4.2.(a) presents the average moonwalk lengths for a network with no malicious
traffic and a network in which a worm is injected at timet = 25, 000 sec (∼ 420 min). It
is clear from this graph that the average moonwalk length increases considerably after the
infection starts. At approximately 650 minutes into the simulation, the walk length in the
worm scenario is almost twice that of the normal case. By thattime the worm has spread to
less than 10% of the vulnerable population (see Figure 4.2.(b)). This observation is crucial
for early detection because it provides containment strategies more time to be effective.
The same experiment was conducted assuming that only 75% of the mobile population is
vulnerable. As Figure 4.2.(a) shows, the difference in moonwalk lengths is still significant.
62
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
More importantly, the results from Figure 4.2 suggest that asimple threshold detector
could alert network operators about the presence of a mobileworm at its early stage of
infection. Briefly, such a detector periodically calculates the moonwalk’s length and keeps
a running average of this length (e.g. using an exponentially-weighted moving average).
When the difference between the current length and the long-term average passes a thresh-
old, the detector raises an alarm about an actively spreading worm. Since user traffic can
vary over time (e.g. traffic volume during the day is usually higher than during the night),
we conducted a simulation wherein we measure the average moonwalk length under vary-
ing traffic volume. Figure 4.3 represents the average moonwalk length when the normal
background traffic is halved and doubled. As can be seen, while there is a small increase in
the average moonwalk length, it is not as marked as in the casewhen a worm is present.
4.2.3 Effect of infection on moonwalk length
We analyze how fast the moonwalk length increases when a wormis present. To do so,
we use simplifying assumptions about the host contact graphand the worm attack to make
the analysis tractable. The goal of this analysis is not to provide a closed form solution, but
instead to support the effectiveness of the detection technique proposed in Section 4.2.
Assume that the host graph hasH hosts,f percent of which are vulnerable, and the
worm is a uniform scanning type of worm with a scan rates scans/unit of time. Further-
more, we assume that all flows, malicious as well as normal, last for a unit of time. Letc be
the average number of non-malicious flows into a node over a unit of time. Obviously, if
63
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
c > 1, all our moonwalks will have a lengthd, whered is the upper bound on the number of
steps a moonwalk can follow, and the moonwalk is in fact useless. On the other hand, the
contact graph of non-malicious traffic is generally sparse and c << 1. Finally, we assume
that the worm starts att = 0 and the moonwalk step is one time unit long.
Let ln(i) be the average length of a moonwalk under normal traffic at time stepi. Then:
ln(i) = (ln(i− 1) + 1) c i = 1, · · · , N (4.1)
ln(0) = 0;
Thereforel(i) = 1−ci
1−c. Note that asc → 1, ln(i) → i. This is intuitively true as there are
always incident flows on a node.
Let lw(i) represent the average length of the moonwalk, when a scanning worm is
present. Then:
lw(i) = (lw(i− 1) + 1)(c +sI(i− 1)
H ) (4.2)
lw(0) = 0;
whereI(t) is the number of hosts infected at timet andsI(i−1)/|H| represents the average
number of scans from infected hosts that arrive at the node ina unit of time.
Let ∆(i) denote the difference in the moonwalk lengths for the normaland the worm
scenario. By using Eqs.(4.1) and (4.2) we can then express∆(i) as:
64
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
∆(i) = lw(i)− ln(i)
∆(i) = c∆(i− 1) + (lw(i− 1) + 1)sI(i− 1)
H (4.3)
i = 1, · · · , N (4.4)
If the moonwalks start at timeN , we can express∆(N) by unfolding the recursion as:
∆(N) =scN−1
H
N−1∑
i=0
(lw(i) + 1)I(i)
ci(4.5)
Sincelw(i) > ln(i) ≥ c and by virtue of Eq.4.1.
∆(N) ≥ scN−2
H
N−1∑
i=0
ln(i + 1)I(i)
ci(4.6)
Simply using the last term of the summation in (4.6) and substituting values for a uni-
form scanning worm, (e.g. for the Witty worm [67]s = 350 scans/tick and conservative
values forf = 0.03,c = 0.5), then at the point when the infection has overtaken 10%of the
vulnerable population, the dilation in the walk length is greater than twice the moonwalk
walk length in the normal traffic scenario.
4.3 Worm Identification
The modified moonwalk algorithm from the previous section can detect the presence
of a mobile worm. We now show how it can also be adapted to identify the first infected
65
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
node (also known aspatient zero) as well as reconstruct the initial infection sequence.
Depending on the speed of detection, the identities of theseinfected nodes can then be used
to thwart the infection, for example by blocking traffic fromthose nodes and inspecting
their traffic to generate attack signatures [68,69].
At the same time, it is generally impossible to pinpoint the patient zero(s) of any in-
fection using a purely flow-centric approach such as moonwalks. To see this, consider a
scenario in which the patient zero was contacted by a benign node prior to the start of the
infection. It is difficult to infer which of these two nodes isin fact the true patient zero
without inspecting the contents of the flow between these nodes or the nodes themselves.
What the moonwalk algorithm can achieve is to considerably reduce the number of nodes
that should be inspected in order to reveal the origins of theinfection.
Specifically, the algorithm identifies a small set of candidate infection trees, one of
which is the true infection tree. In this context we define theinfection tree as the graph
induced by the worm node infection sequence. The first step inthis process is to iden-
tify each of these trees’ roots. To do so, we modify the moonwalk algorithm in three key
aspects. First, we include a stopping condition to the moonwalk, a parameterTs which de-
notes the estimated time when the infection started. We thenhalt every moonwalk when it
proceeds pastTs. This parameter can be estimated using the detection algorithm described
in Section 4.2. Specifically, we showed that just before the infection enters its exponential
increase phase, the average path length departs from the normal average. Then, if the worm
is detected at timeTd, we setTs = Td − ∆T . The value of∆T depends on the mobility
66
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
pattern and worm characteristics. In general, given a specific network and mobility model,
∆T should be set to the amount of time required for an infection to propagate to a popular
domain from any wireless domain in the network. Second, in addition to edge frequencies,
we record the frequencies with which each root node (i.e. alleged patient zero) appears
in the moonwalks as well as the average walk lengths associated with each of these root
nodes. Finally, we start each walk randomly, but only from nodes within the topp% popular
domains. The rationale for this choice is based on the observation that a mobile infection
in its initial phase moves towards domains of high popularity [62]. Therefore it is more
likely, that during the early stages of the infection a malicious flow will be encountered in
popular domains. We setp = 25% to maximize the probability that at least some of the
moonwalks will follow a backwards path on the infection tree.
Once we carry out the random moonwalks, we draw the scatter plot of the root node
frequency versus the walk length. The outliers in this scatter plot, that is the nodes with
high frequencies and long walk lengths, are the possible roots of the infection trees. The
intuition behind this approach is two-fold:(a) The frequency of the actual patient zero is
high because worms tend to form tree-like structures and therefore multiple reverse paths
lead to that node.(b) Unlike worms, non-malicious node contacts do not tend to form long
paths.
We evaluate the performance of the algorithm using the simulation setup presented
earlier. Figure 4.4 illustrates the scatter plot from the output of 10,000 moonwalks on2.4×
105 flows, with∆T set to five hours. We use a simple filtering algorithm for identifying the
67
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
0
5
10
15
20
0 0.002 0.004 0.006 0.008 0.01 0.012
Moo
nwal
k Le
ngth
(in h
ops)
Frequency
Worm Tree
Figure 4.4: The scatter plot of walk length versus root node frequency. The square dotindicates the actual patient zero.
outliers from this scatter plot. Points having walk length and frequency greater than 90%
of the rest are chosen as outliers. The dashed horizontal andvertical lines in Figure 4.4
represent these 90th percentiles. The points in the upper right quadrant of the graph are the
roots of the candidate infection trees. The actual patient zero is also shown in Figure 4.4
with a square dot. While more sophisticated outlier detection algorithms could be used, we
found in practice that this simple approach produces a smalllist of candidates that always
includes the actual patient zero.
Starting from each of the candidate patient zeros, we reconstruct the candidate infection
trees using all the edges that were traversed during the moonwalk phase. This is done using
a simple breadth first search (BFS) discovery. Figure 4.5 presents the results of this traversal
for three of the candidates from Figure 4.4. The nodes in these trees need to be further
inspected for signs of infection. While the actual inspection method is out of the scope
of this work, we can show that only a small percentage of nodesmust be inspected. For
example, if we traverse all tree nodes up to depth three, then∼ 2% of the total population
68
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
must be inspected. As an aside, the actual infection tree shown in Figure 4.5 was the one
rooted at node 5344 and all the nodes in that tree were actual infected nodes. This result
is encouraging because it indicates that the number of nodesfalsely identified as active
spreaders is rather low.
Last, we investigate the effect of differentTs (estimated infection start time) values on
the outcomes of the algorithm. In our experimentsTs is set to the detection time minus∆T
(five hours). In most cases, we noticed that this is a slight underestimation of the actual
infection onset. The problem with underestimation is that the true patient zero could have
been contacted by other nodes in this time, thereby reducingthe frequency of these nodes
in the moonwalk. Nonetheless, the proposed algorithm is still effective, as the nodes which
contacted the true patient zero still have relatively high frequencies and long path lengths
in lieu of the patient zero having a high frequency and path length. The downside is that as
the estimation error increases, the number of nodes that need to be inspected also increases.
In our experiments, we noticed that even if the actual start time is underestimated by about
10,000 seconds (∼2.8 hours), we needed to inspect at most5% of the total node population.
4.3.1 Discussion
We have shown that the infection tree has both a high patient zero occurrence frequency
and a long walk length. However, as the volume of normal traffic increases, it adds more
noise to the selection algorithm. In other words, the normaltraffic starts forming trees with
lengths comparable to those of infection trees. As a result,the number of candidate trees to
69
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
5344
963 3658 2880 2223 1344 5717
3901 3917 3569 527 272 1195 1050 89 3013 2826
1109 4774 487 2054 333 3468 1575 853 44 390
3114 1958 4012 1617 3798 1833
3884 2988
1167
2018 3867
3256
2148
4709
Figure 4.5: Candidate infection trees reconstructed usinga BFS search. The tree rooted at5344 is the actual infection tree. All nodes in this tree wereindeed infected by the worm.The trees rooted at 1167 and 2148 are benign. A directed edge between nodeX andYindicates thatX initiated at least one flow toY .
inspect increases. In the extreme case, the infection tree could remain ’hidden’ within the
volume of normal traffic.
We investigate the effect of the normal traffic volume by running the proposed identifi-
cation algorithm on networks with increasingly higher levels of normal traffic. To do so, we
keep the Lognormal distribution of the inter-arrival time of flows presented in Section 4.1.1,
but decrease the mean inter-arrival time thus generating increasing levels of normal traf-
fic. As Figure 4.6 illustrates, the percentage of mobile nodes that should be investigated
for signs of infection increases as hosts spawn flows faster.Nonetheless, two encouraging
observations can be made. First, the algorithm continues toidentify the infection tree as
the volume of normal traffic increases. Second, a decrease inthe inter-arrival time by three
orders of magnitude increases the number of nodes that must be inspected only by sixfold.
70
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
0
1
2
3
4
5
6
7
1 10 100 1000
Per
cent
age
(%)
Inter-arrival time(min)
Figure 4.6: Percentage of mobiles nodes that need to be inspected for signs of infection asa function of the normal traffic intensity.
Even when the average flow inter-arrival time is roughly ten minutes—a high value for intra
domain traffic in wireless networks—the algorithm needs to inspect only 6% of the overall
node population.
Finally, we briefly address a few issues on why the deploy-ability of the proposed
framework could be practical with reference to the latency in real time log collection, di-
verse background traffic and the size of connection logs. An in depth study is deferred
for future work. As noted earlier, we assume that flow recordsare either aggregated at a
centralized database, or available through a federated database (as proposed in NFA [64]).
As seen from the figure 4.2, the detection algorithm can bear upto a latency of an hour,
while still providing early detection. With regards to other types of background traffic, P2P
traffic also tends to form long paths. However, they don’t cause large changes in the path
lengths. Since our detection algorithm concentrates on thechanges in the path lengths,
worm and normal P2P traffic can be differentiated. With regards to the amount of space
required to store the flow headers of the intra-domain traffic– assuming that the source host
71
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
ID, destination host ID, domain ID, start and end times of a flow require four bytes each, a
flow record can be described in 20 bytes. In this case, even if we assume that nodes initiate
new connections at a rapid pace of one flow every minute, the storage space required for
the simulated network of 6,000 nodes is a modest 165 MB/day. Moreover, we expect the
storage space to increase slowly as the number of hosts increases because the host contact
graph is sparse.
4.4 Related Work
The threat of mobile infections was first discussed by Anderson et al. [54]. Saratet al.
derived the speed of mobile worms through analysis and simulations [62], while more re-
cently attacks against metro-area wireless networks were discussed by Akritidiset al.[70].
Our work is inspired by the work of Xieet al. on random moonwalks [63, 64]. How-
ever, that technique is primarily a post-mortem tool for identifying infected nodes, in the
context of Internet worms, using host contact graphs. As we showed in Section 4.1.1, the
effectiveness of the standard moonwalk method decreases rapidly as worms become more
stealthy and infections are carried across domains by mobile nodes. We address the limi-
tations of the original approach by exploring different heuristics such as moonwalk length
and node occurrence frequency and show that moonwalks can infact be used as a tool for
the detection and origin identification of mobile worms.
Origin identification has been studied in the context of Internet worms. For example,
72
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
Kumaret al. presented a forensic analysis of the Witty Worm [71], by reverse engineering
the random number generator used by the worm [72]. In contrast, our technique is flow
based and thus worm agnostic. Recently, there has been a bodyof work on securing en-
terprise networks [73, 74]. While the network environment is orthogonal to the one used
in this chapter, we believe that the moonwalk technique presented herein can play a role
within such centralized architectures to detect and provide forensic analysis of malicious
activities.
4.5 Summary and Future Work
This chapter presents mechanisms to detect the existence and to identify the evolution
of worms spreading through a collection of wireless domains, carried by the physical move-
ment of mobile hosts. The proposed approach extends the existing framework of random
moonwalks by focusing on the combination of moonwalk lengths and node frequencies to
detect the existence of a stealthy worm and determine the identities of the infection’s initial
victims.
While we evaluate these algorithms in the context of mobile networks, we believe that
they are also applicable to other worm scenarios. Because moonwalks essentially cull out
worm edges in the presence of noisy background traffic, we believe them to be robust in
the presence of missing traffic or in a distributed scenario in which some domains are non-
cooperative.
73
CHAPTER 4. ON THE DETECTION AND ORIGIN IDENTIFICATION OF MOBILEWORMS
Acknowledgments
We gratefully acknowledge the use of trace data from the CRAWDAD archive at Dart-
mouth College. We thank Fabian Monrose, Razvan Musaloiu-E., and Moheeb Abu Rajab
for their suggestions. Brian Hoffman helped immensely in collecting the intranet data
traces. This work was supported in part by the National Science Foundation through grant
CNS-0627611.
74
Chapter 5
On Web Browser Protection
The web browser is the most widely used network application on the Internet today. The
past few years have seen a spate of browser related vulnerabilities e.g. cross site scripting
(XSS), cross site forgery attacks (CSRF) etc. The majority of these attacks exploit the trust
placed by a web browser on a web site providing content. Whilesuch trusting browsers
function well for single source content, recent years have seen web sites evolving from
essentially single-principal sources to one in which a single page contains a mashup of
code and data from multiple, perhaps mutually distrusting sites. The increasing number
of attacks exploiting web browsers, indicate that the current security policies that browsers
enforce are clearly inadequate.
In this chapter, we focus on novel browser abstractions withthe aim to alleviate browser
vulnerabilities. Specifically, we propose two new abstractions for (a) content which needs
to be completely isolated from other domains (b) shared content amongst domains with ac-
75
CHAPTER 5. ON WEB BROWSER PROTECTION
cess control enabled. Existing browsers support the isolated abstraction using the<frame>
or <iframe> tag. However, the origin of the frame and the document must bedifferent.
Consequently, this technique is ineffective against same site XSS attacks, like the Samy
worm [75]. Furthermore, the abstractions presented herein, allow controlled sharing of
content. To illustrate this, we use the example of a hypothetical social networking site in
which users can view, share and execute each others’ java scripts. The abstractions in to-
day’s browsers are not granular enough to accomplish such sharing, without compromising
security, due to the danger of an XSS attack. For example, if Alice and Carol are allowed
to submit scripts to Bob’s profile page, then Bob has no way of selectively executing only
Alice’s script, when Bob views his profile page in the browser. Allowing controlled access
to the entities of a HTML page e.g. DOM, cookies,etc can eliminate XSS attacks, even
when scripts are allowed as user generated inputs. We designa multi-principal browser,
which supports these abstractions. For this, we modify the Konqueror browser source code
as a proof of concept of the effectiveness of the abstractions. The changes are backwards
compatible with legacy systems.
Briefly then, the rest of this chapter is as follows. Section 5.1 provides a brief overview
of the present day browser protection mechanisms and the vulnerabilities present. Sec-
tion 5.2 details the abstractions introduced in this chapter. In Section 5.3, we describe the
implementation of the abstraction in the Konqueror browser. We present related work in
Section 5.4. Finally, we conclude and present avenues for future work in Section 5.5.
76
CHAPTER 5. ON WEB BROWSER PROTECTION
5.1 Background
Web pages of today provide a rich, interactive experience driven by client side scripting,
enabling asynchronous requests. Moreover, web pages are more commonly multi-principal
e.g web mashups. These websites are composed of content originating from more than one
site. For example, the pipes.yahoo.com mashup wizard usersconnect to pipes.yahoo.com
to get data. The request is proxied to the real data providersand the response data is
then passed back from pipes.yahoo.com to the mashup. For example, a custom mashup
could possibly source image data from Flickr correspondingto a news item from CNN.
Furthermore, the AJAX (Asynchronous Javascript and XML) programming model is com-
monplace today in applications like Google Maps. AJAX uses client-side javascript to
maintain interactivity while network-centric requests are relayed in the background using
XMLHttpRequest (Asynchronous XML) calls to the server. Browsers of today, are inca-
pable of handling such complex access control policies. In practice, mashups are created
using third-party proxies which reformulate the page before being sent to the browser. This
is not granular enough to be either secure or scalable.
5.1.1 Same Origin Policy
The same origin policy (SOP) governs the control access on today’s browsers. The
philosophy of the SOP is simple: it is unsafe to trust contentloaded from third party web-
sites, in the context of a webpage. As semi-trusted scripts are run within the sandbox,
77
CHAPTER 5. ON WEB BROWSER PROTECTION
they should only be allowed to access resources from the samewebsite, but not resources
from other websites, which could potentially be malicious.Two pages share the same ori-
gin if the protocol, port and host are the same for both pages.Every browser window,
<frame> and<iframe> is associated with an origin. While it is not possible to directly
query websites for data due to the same origin policy, the<script> tag does not honor the
same-origin policy. A web page might contain<script> elements sourced from different
domains. Such scripts function under the purview of the document’s origin and can access
all of the document’s resources. For example, if a page a.com/index.html contains a script
tag<script src=”http://b.com/myscript.js>, then myscript.js has access to all the DOM el-
ements, cookies and data of a.com’s index.html page. However, myscript.js cannot access
any resource pertaining to b.com within this context.
5.1.2 XSS attacks
In XSS, an attacker often exploits the case where a web serverdirectly sources user
input into a dynamically generated page, without first filtering the input. Attacks can either
be either persistent or non-persistent. Persistent attacks (or stored vulnerability) occur when
data provided to a web site by a user is stored on the server, but is not sanity checked
for script entities. The malicious script then executes with the site as its origin, and can
send sensitive data back to the attacker. A classic example is the Samy worm [75] which
propagated across the MySpace social-networking site. Non-persistent attacks are reflected
attacks, wherein data provided by a web client is used immediately by the server side to
78
CHAPTER 5. ON WEB BROWSER PROTECTION
generate a page of results for that user. If unvalidated usersupplied data is included in
the resulting page without sane HTML encoding, client-sidecode can be injected into the
dynamic page. An attacker can then use social-engineering to trick a user into visiting the
URL, which contains the malicious script in the dynamic page.
The root cause of XSS attacks are unsanitized user input and unexpected script exe-
cution. Typically server side applications of today, sanitize input by HTML encoding all
user input (e.g.%3c inplace of<). However, websites often allow rich user input, in the
form of HTML or images. Parsing for scripts within such rich user input is non-trivial, as
demonstrated by the many existing ways of injecting a script[8].
The other known approach to defend against XSS attacks is to constrain user data by
using the SOP policy. A cross-domain iframe is used to display all user-supplied data,
inclusive of scripts. For example, Alice’s content page is served fromalice.server.com
while Bob’s user generated content is put into an<iframe> sourced frombob.server.com.
Since the origins are different, there is isolation. Such anapproach is however not scalable
because the server has to maintain different domains for every user generated input. Fur-
thermore, script interactions with the rest of the page are then restricted and the display is
not flexible.
79
CHAPTER 5. ON WEB BROWSER PROTECTION
5.2 Trust Model
As discussed previously, existing browsers depend purely on the<script> tag for cross-
domain communication. In lieu of this, we introduce two additional abstractions one for
isolating and the other for controlled access sharing, using the<isolate> tag. Consider a
webpage containing a tag such as:
<isolate src=”http://server.com/alice.html” id=110>
This creates an isolated environment. This is akin to the iframe environment with a different
source. Since, the content within<isolate> is private, when the src attribute indicates a
path from a different domain, the enclosing page cannot access the content of the page
within the isolatetag - this is a side effect of the SOP policy. However, when thecontent
comes from the same domain, the enclosing content can fully accessed theisolatedcontent.
Finally, the isolated content cannot reach out to access (read/write/execute) script elements,
DOM elements, cookies etc. of the enclosing page, even if their origins are the same.
Sharing can be enabled by treating the id as a bitmask. To illustrate this, consider
another isolate tag in the same page.
<isolate src=”http://server.com/bob.html” id=101>
An example of an access control scheme could be defined to allow sharing between all the
isolated environments, which have their most significant bit (MSB) set to 1. The method of
access control could infact be made similar to process access control in traditional operating
systems. For now, we treat the topic of access control sharing as an avenue for future work.
80
CHAPTER 5. ON WEB BROWSER PROTECTION
DEFENSE AGAINST XSS
As was seen in section 5.1, XSS attacks arise due to a confuseddeputy problem in
the browser abstractions. Since, the browser is unable to distinguish between “good” and
“bad” scripts, an all or nothing approach is used. Using the abstractions presented herein,
the web server serves unauthorized content within an isolate tag.
We will briefly outline our defense strategy, using the example of a message forum
where users are allowed to post scripts besides HTML. Every user generated input is pack-
aged into an isolate tag, such as:
<isolate src=”http://forum.com/userA.html” id=101>
Here id is specific to a user. Every user is assigned different ids. Inthis way, sharing of
scripts can also be implemented, if desired. Now, scripts can access elements within the
isolate subtree, but any access to the enclosing page is denied. The rationale behind this
technique is as follows: If we consider user generated inputas tainted information, the
server can easily distinguish input from different users and differentially taint them using
user ids. The isolate tag then instructs the browser to treattainted content appropriately,
either isolating it or allowing it to be shared etc. This technique works effectively even in
the case of a non-persistent XSS attack too, whereby the usergenerated input is isolated
from the enclosing page.
81
CHAPTER 5. ON WEB BROWSER PROTECTION
5.3 Konqueror Implementation
We built a proof of concept browser based on the abstractionsdesigned above. To do
so, we modified the Konqueror browser source [76] code and runs on linux. The changes
are backwards compatible with legacy systems.
We only implemented the isolate abstraction, leaving the sharing abstraction as a part of
our future work. Our extension to the konqueror source code is an extension which sits in
between the javascript interpreter (KJS) and the HTML browser engine (KHTML). When-
ever a script element is encountered, KHTML passes off this script to the KJS interpreter,
which then returns results of evaluation to the KHTML. Script execution can manipulate
DOM elements. Therefore, whenever a DOM object is encountered within a script, calls
are made back to the KHTML library for references to these objects. We illustrate the call
flow using an example (fig 5.1).
<script type=text/javascript>
document.write(”The title of my parent is” + parent.document.title);
</script>
The KHTML library calls the evaluate function of the KJS interpreter and passes it the
script code. However, KJS needs a reference to the objectdocumentto resolvedocu-
ment.title. A call is made back to the KHTML library to obtain the reference to the cor-
responding DOM object. Similarly, a call is made from KJS back to the KHTML library
82
CHAPTER 5. ON WEB BROWSER PROTECTION
DOM Object reference
KHTMLPart::ExecuteScript
KJS::Interpreter::Evaluate
Javascript Interpreter
KHTML Browser engine
KJS::Window::Get KHTML
Extension Proxy
DOM Object access
Figure 5.1: The proxy extension overlayed on top of a simplified javascript call graph.
to resolveparentof thedocumentobject. Our extension acts as a proxy between KTHML
and KJS. Whenever calls are made from KJS to KHTML, we check ifthe calls are allowed
as per the isolation restriction. In the above example, if this script is from anisolatecon-
tent, then the KHTML library returns thedocumentitself as the parent. Otherwise, the true
parentis returned.
5.4 Related Work
There has been a plethora of work in studying and protecting browsers and the under-
lying operating system from browser vulnerabilities such as drive by downloads. Moschuk
et.al [77] conducted a study of spyware on the web by crawling18 million URLs in May
2005. HoneyMonkey by Wang et. al detects exploits against Windows XP while visiting
83
CHAPTER 5. ON WEB BROWSER PROTECTION
sites in Internet Explorer. Provos et.al [78] crawl millions of pages to determine which
URLs are malicious (drive-by-downloads etc). These pages can later be marked so as to
caution the user against visiting these URLs. This work is different as it focuses on the
weaknesses in the security policies of the browser rather than web-based infections. Fur-
thermore, these approaches typically deploy a browser within a virtual machine and detect
any changes to the operating system, while visiting websites to mark URLs. This approach
is heavyweight and doesn’t work with vulnerabilities such as XSS.
Several new browser communication proposals have emerged due to the limitations
of the SOP [79], wherein a site may request information from any other site, and the re-
sponder can check the request to decide how to respond e.g. Flash. while such policies
are verifiable, they cannot contain XSS attacks. Subspace [80] provides a cross-domain
communication mechanism using a small javascript library.Subspace divides a site into
subdomains. A subdomain can be used to source scripts from other domains. Cross-sub
domain channel is setup by setting the document.domain of the two subdomains to a com-
mon domain postfix. However such an approach is cumbersome tothe mashup developer,
when many sources are involved. Browser enforced embedded policies (BEEP) [81] allows
a website to white list scripts i.e scripts are safe to run. While such a technique can combat
XSS attacks, it is still lacking in terms of abstractions foraccess control which could be
employed in sharing scripts. MashupOS [82] is perhaps the closest to the work presented
in this chapter. In MashupOS project, a multi-principal browser is built based on the ab-
straction<Sandbox> and<Opensandbox>. <Sandbox> is similar to the<Isolate> tag.
84
CHAPTER 5. ON WEB BROWSER PROTECTION
<Opensandbox> is similar to<Sandbox>, the only difference being that the enclosing
page can access the sandboxed content. They also envisage<ServiceInstance> as a unit of
abstraction which guarantees resource allocation and<CommRequest> for cross-domain
communication. The sharing in their model is explicitly done by adding a commrequest
agent while ours is closer to the process sharing in Unix likesystems.
5.5 Summary and Future Work
Contents sourced from various sites combined with asynchronous programming mod-
els (AJAX) have necessitated the need for browsers to becomemulti-principal. Browsers
need to be able to handle trust relationships between different sites and between entities on
the same sites. This chapter focuses on providing abstractions to deal with the protection
and sharing mechanisms, to improve browsers. This is a majorimprovement over today’s
browsers which employ an all or nothing trust relationship.Using a modified version of
the Konqueror browser, we showed how XSS attacks can be contained using the<isolate>
abstraction.
While these abstractions act as a defense against XSS and help build a robust mashup,
they still lack in terms of managing browser resources and fault containment. Browsers
need to act as defacto operating systems for executing client side components of web ap-
plications, providing isolation and methods of resource management and fault containment.
Such an architecture could perhaps help combat emerging attacks e.g. pharming and pup-
85
CHAPTER 5. ON WEB BROWSER PROTECTION
petnets [83]. Puppetnets employ large swathes of rogue websites which redirect requests
from web clients to third-party website, thereby creating adenial of service (DOS) like
phenomenon. Pharming attacks can be used for local subnet fingerprinting. For example,
a rogue website can include malicious javascript in its pageto scan a local subnet behind
a firewall and send scan results back. To deal with such attacks, a complete operating sys-
tem style resource management abstraction in the browser with isolated memory, display,
network resources and fault containment seems to be essential and deserves further study.
In such a browser, extraneous web connection, opened on behalf of the website could be
monitored and perhaps curtailed.
86
Chapter 6
On the Use of Anycast in DNS
There have been several targeted DDoS attacks on one or more of the thirteen DNS root
servers [6]. Such attacks are significant because the root nameservers provide an impor-
tant translation service, vital to the core functioning of the Internet. Therefore, an attack
on the DNS fabric tends to takedown the entire Internet, rather than specific websites as
is normally the case. As shown in chapter 2 of this thesis, botnets of today can include
hundreds of thousands of nodes, distributed all over the world. Hence, protecting the In-
ternet infrastructure against such large adversaries is important. Accordingly, to meet this
robustness criterion, anycast is widely deployed in DNS today [84]. The IP addresses of
many top level DNS nameservers correspond to anycast groups. Client requests sent to
these addresses are delivered by the Internet routing infrastructure to the closest replica in
the corresponding anycast group. DNS operators have deployed anycast for a number of
reasons: reduced query latency, increased reliability andavailability as well as resiliency
87
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
to DDoS attacks. While it is generally agreed that the deployment of anycast in DNS has
been a positive step, no studies have been done to evaluate the performance improvement
offered by anycast. This chapter, which is drawn from our work [85], presents the first
comprehensive study in this area.
Specifically, we aim to answer the following questions: (1) Do servers deploying any-
cast experience smaller number of outages and what is the duration of these outages? (2)
How stable is the anycast server selection over time? (3) Does anycast reduce query laten-
cies? To answer these questions, we performed a measurementstudy using clients deployed
over PlanetLab [86], to measure the performance characteristics of four top-level servers
using anycast and compared it to a server not using anycast. In our study, we identified a
set of different anycast deployment strategies, that are currently used in practice. Thus, we
monitored servers that represent different points on the anycast design space to compare the
effects of these design choices. Specifically, we evaluate the effects of single vs. multiple
anycast addresses for a zone and global vs. localized visibility of the servers in the anycast
group. We also compared these servers against a hypothetical zone with the same number
of nameservers but where all the nameservers are individually addressable. By doing so,
we can directly compare anycast to the traditional zone configuration guidelines [87].
Our results can be summarized as follows: We found that for all the measured zones
and independently of the anycast scheme used, the deployment of anycast decreases aver-
age query latency and increases availability when comparedto centralized servers. Further-
more, our study shows that while the number of query failuresis relatively small (≤ 0.7%),
88
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
outages are long in duration (≈30% last more than 100 seconds), affected by long BGP
routing convergence times. Interestingly, we show that, even though the outage duration is
not affected by the anycast scheme, the frequency of the outages relates to the scheme used,
i.e. whether servers have local or global visibility. In addition we identified that the anycast
scheme determines the percentage of queries directed to theclosest anycast instance. This
value ranges from about 37% for servers with a few global nodes to about 80% for servers,
wherein all nodes are global. We also uncovered an inherent trade-off between increasing
the effectiveness of anycast in directing the queries to thenearest server and stability of
the zone itself. For servers that advertise all their anycast group members globally, clients
choose the nearest server most of the time. The negative effect though is that, in this case
the zone becomes vulnerable to increased number of network outages and server switches.
The rest of this chapter is structured as follows: We give a brief introduction to any-
cast in Section 6.1 and explain our measurement methodologyin Section 6.2. Section 6.3
presents the servers used in this study and provides the rationale for choosing them. We
present our results and compare the different anycast strategies in Section 6.4. In Section
6.5 we outline a novel technique for configuring anycast groups benefits that maximizes
redundancy and distributes load evenly among the members ofthe anycast group. Finally,
we present related work in Section 6.6 conclude in Section 6.7.
89
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
6.1 Background
Anycast, first described in [17], provides a service, whereby a host transmits a datagram
to an anycast address and the internetwork is responsible for delivering the datagram to at
least one, preferably the closest, of the servers in the anycast group. The motivation behind
anycast is that it simplifies service discovery. A host does not have to choose from a list of
replica servers, offloading the responsibility of forwarding the request to the “best” server
to the network.
Ra1
Rb1
Rb4
Ra4
Lw Le
Ra3
Ra2
Rb3
Route announcementsDNS Requests
Rb2
ASB
ASA
CbCa
Figure 6.1: Sample Anycast configuration
Since the benefits of anycast are largely derived from its implementation, we briefly
review how anycast is currently implemented in the Internet. In Fig. 6.1, the two serversLe
andLw are members of the anycast group represented by addressI. Each of these servers
(or rather their first hop routers) advertise a prefix that coversI1 using BGP [88] toRa4
1The prefix is usually a /20. This requirement emerges from thefact that advertisements for shorter
90
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
andRa3 in ASA. Each of these routers in turn propagate the advertisementsto their iBGP
peersRa1 andRa2. The process continues until the advertisements reach the egress routers
Rb4 andRb3 of ASB, where customersCa andCb are connected respectively. RouterRb4
chooses the advertisement from iBGP peerRb1 because the IGP distance toRb1 is shorter
than distance toRb2. This selection is usually calledhot-potatorouting because it causes
packets to exit the provider’s network as early as possible.The final effect of these choices
is that packets fromCa follow the pathCa → Rb4 → Rb1 → Ra1 → Ra4 → Lw. Similarly,
packets fromCb follow the right vertical path. It is evident from this description that the
combination of BGP hot potato routing inside an autonomous system and shortest AS path
routing across autonomous systems results in choosing the closest anycast server, closest
being defined in terms of IGP metric and AS hop length.
Operators can incorporate anycast in their DNS zones, i.e. the domains that are under
their administration, in a number of ways. For example, the operator can use one or mul-
tiple nameservers addresses (NS records in DNS parlance) each with a different anycast
address. Anycast prefixes can be globally advertised or their scope can be limited to the
immediate neighboring autonomous systems. Servers whose advertisements are scoped
are calledlocal nodeswhile nodes with no scoping are calledglobal nodes. Local nodes
limit the visibility of their advertisements by using theno-export BGP attribute. Peers
receiving advertisements with this attribute should not forward the advertisement to their
peers or providers. Scoping is used to support servers with limited transaction and band-
prefixes are not propagated by the routing infrastructure toreduce the size of the global routing table.
91
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
width resources and servers serving only local networks. Finally, the anycast prefix(es) can
originate from a single AS or the zone operator can bemultihomedso multiple ASes inject
the prefix to the global BGP table.
In addition to the anycast address, each server in the anycast group has a unique unicast
address. This address is mainly used for management purposes (e.g. zone transfers) and is
selected from prefixes that are different from the prefix containing the anycast address. To
ensure that the management interface is reachable even in case the anycast prefix becomes
unavailable (e.g. during a routing outage or DDoS attack on the anycast address) the routing
path to the anycast address is different from the path to the unicast address. The importance
of this fact will become clear in Section 6.4.4 where we investigate whether anycast leads
clients to the closest server.
6.2 Measurement Methodology
Our goal is to investigate the implications of using anycastin DNS and to compare the
performance benefits of different anycast configurations. The two primary factors affecting
the performance of anycast are:(I) the number and location of the anycast servers relative
to the DNS clients, and(II) the anycast scheme used. Specifically whetherscoping is
used and whetherone or moreanycast addresses are visible to the clients. To quantify
the relative benefits of each of these factors we used the following four types of server
configuration in our measurements, each representative of adifferent point in the anycast
92
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
design space:
(1) A server with one or more instances in a single geographic location: While this
case does not use anycast, we use it as a base case to explore the potential performance
improvements of using multiple geographically distributed servers. We chose the B-root
nameserver as a representative of this category.(2) A server using a single anycast address
for all its instances, with multiple instances in differentlocations. We used the UltraDNS
servers, which are authoritative for the .org and .info top level domains, as the representative
of this category. UltraDNS servers are members of two anycast groups TLD1 and TLD2,
with all the instances being globally visible.(3) A server using a single anycast address
for all its instances, with multiple instances in differentlocations, and with some instances
being globally visible and with some scoped in a local region. To investigate the effects of
the number and location of anycast group members on performance we chose two different
examples: the F-root nameserver and the K-root nameserver.(4) A set of geographically
distributed servers, each individually accessible via unicast. We used this case to evaluate
the quality of the routing paths provided by the network fabric connecting the anycast
servers to their clients. To enable a direct comparison withanycast, we want to keep the
number and location of the name servers constant. To do so, weused the F-root example,
but in this case clients send requests to the unicast addresses of the F-root group members.
Each client maintains an ordered list of all the servers based on their latency and sends its
queries to the closest server from its list. In case a server becomes unavailable, the client
tries the subsequent servers in its list until it receives a response.
93
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
Because DNS clients have no control of where their queries are directed, we need clients
in multiple locations to cover all the servers in an anycast group. For this reason, we used
the PlanetLab [86] testbed for our measurements. We collected data from the PlanetLab
nodes from September 19, 2004 to October 8th, 2004. At the time of our measurements,
there were approximately 400 nodes in PlanetLab contributed by universities and research
labs around the globe. The results presented in this chapterare based on measurements
from approximately 300 active PlanetLab nodes. As we already mentioned, the client
locations relative to the servers can potentially affect our measurements. Table 6.1 shows
the distribution of PlanetLab nodes based on their geographic location.
We ran a script on every PlanetLab node to send periodic DNS queries to each of the
DNS servers earlier mentioned. The query interval we used isuniformly selected from
[25,35] seconds. We used this interval to achieve sub-minute accuracy for outage durations
reported in Section 6.4.2. Our script records the query latency and the server name cor-
responding to the anycast instance answering the query. Thescripts uses “special” DNS
requests to retrieve the name of the server replying to a request sent to the anycast address
( [89] shows the request type for F-root).
Continent % of PL nodesSouth America 0.5Australia 1.8Asia 15.8Europe 16.7North America 65.2
Table 6.1: Distribution of used PlanetLab nodes around the world.
94
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
From a client perspective, DNS has to be always available andfast. To see how anycast
contributes towards these end-user requirements, we compare the selected anycast deploy-
ment schemes based on the following criteria:
QUERY LATENCY
Reduction of end-user delay is an oft quoted benefit of deploying anycast. To test
whether this claim is true, we measure the latency of requests sent to the monitored servers
and compare the results. Since anycast achieves this reduction using localization, we also
calculate the percentage of DNS queries that are in fact routed to the nearest anycast in-
stance.
AVAILABILITY
For a global infrastructure service such as the DNS, availability is a key issue. To
evaluate the impact of anycast on availability, we measure the number and duration of
outage periods. An outage period is the time during which clients receive no replies to
their requests. During these periods client name queries are not resolved. Since DNS
requests and replies use datagrams, in case of a timeout, we resend the request twice to
differentiate between dropped packets and real DNS outages. The beginning of an outage
period is marked by the consecutive loss of all three requests. The end of the outage period
is marked by the receipt of the first answer from a DNS server. The difference between the
end and the start of an outage period gives thelengthof the outage period.
95
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
CONSTANCY
Constancy measures the affinity of clients to a specific instance of the anycast group.
If a client switches from one instance to another, we say aflip has occurred. We use
the number of flips as a measure of constancy. We also calculated the amount of time
PlanetLab nodes are directed to the same server in the anycast group as a metric of the
stability of the anycast service. While constancy is not critical for DNS queries over UDP,
TCP transactions will be reset during server changes. Even though the percentage of TCP
transactions is small today, we expect it will increase in the future with the introduction
of DNSSEC [90]. Furthermore, these results can be extrapolated to provide an indication
whether longer transactions, such as bulk transfers over TCP, would be affected by server
changes.
6.3 Anycast Deployment Strategies
In this section, we present the configuration of the monitored anycast servers and the
distribution of requests from the PlanetLab clients to eachof these servers.
6.3.1 Multiple Instances, One site: B-Root
The B-Root server has 3 nodes (b1/b2/b3.isi.edu), all of which are located in Los An-
geles, CA. All the servers reside in the same network and therefore, this scenario is repre-
sentative of a multiple instance, one site server (case 1 in Sec. 6.2).
96
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
6.3.2 Multiple Instances, Multiple Heterogeneous Sites:
F,K-root
This is the case where an anycast server has multiple instances deployed in geograph-
ically diverse sites, with some instances being globally visible and the rest of them being
scoped in their local region. For our measurements, we selected two anycast groups that
follow this configuration: the anycast groups of the F-root and K-root nameservers.
Table 6.2 gives a complete listing of all the F-root clustersat the time of our measure-
ments. This list is publicly available at [91]. One can see from Table 6.2 that a high percent-
age(∼ 70%) of nodes are served by the F-root clusters PAO1 (Palo Alto) and SFO2 (San
Fransisco). This is because these two clusters, have been deployed as global nodes [92].
The rest of the clusters are visible locally and serve clients only within their communities.
Out of the 26 listed F-root clusters at the time of this study,PlanetLab nodes contacted only
16 F-root clusters. The reason for this behavior is that the unreachable nodes have local
scope and no PlanetLab nodes are located within their scope,because the AS path to the
global node is shorter than the path to the local node. Therefore the PlanetLab site chooses
to route requests to the global node instead of the local.
Like F-root, K-root consists of global and local nodes, albeit much smaller in the group
size. K-root consists of multiple clusters primarily concentrated in Europe. Table 6.3 gives
a complete list of K-root clusters and their reachability from PlanetLab nodes. Clusters
at Amsterdam and London have global visibility, while the remaining have local visibility
97
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
Cluster Location %PAO1 Palo Alto, CA, USA 38.5SFO2 San Francisco, CA, USA 32.1MUC1 Munich, Germany 4.9HKG1 Hong Kong, China 3.5LAX1 Los Angeles, CA, USA 3.4YOW1 Ottawa, ON, Canada 3.1LGA1 New York, NY, USA 2.8SIN1 Singapore 2.0TLV1 Tel Aviv, Israel 1.7SEL1 Seoul, Korea 1.6SJC1 San Jose, CA, USA 1.5CDG1 Paris, France 1.1YYZ1 Toronto, ON, Canada 1.1GRU1 Sao Paulo, Brazil 1.0SVO1 Moscow, Russia 0.8ROM1 Rome, Italy 0.7AKL1 Auckland, New Zealand -BNE1 Brisbane, Australia -DXB1 Dubai, UAE -JNB1 Johannesburg, South Africa -MAD1 Madrid, Spain -MTY1 Monterrey, Mexico -TPE1 Taipei, Taiwan -CGK1 Jakarta, Indonesia -LIS1 Lisboa, Portugal -PEK1 Beijing, China -
Table 6.2: List of the 26 F-root sites. The last column shows the percentage of PlanetLab
nodes served by each F-root cluster. Example of an F-Root server is SFO2a.f-
rootservers.net.
98
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
[93]. This explains the high percentage(∼ 97%) of PlanetLab nodes served by LINX
(London) and AMS-IX (Amsterdam).
Cluster Location %ams-ix Amsterdam, Netherlands51.6
linx London, UK 46.7denic Frankfurt, Germany 0.9grnet Athens, Greece 0.7mix Milan, Italy -qtel Doha, Qatar -isnic Reykjavik, Iceland -
Table 6.3: List of the 7 K-root sites.
6.3.3 Multiple Instances, Multiple Homogeneous Sites:
UltraDNS
This is the case where an anycast server has multiple instances in diverse geographic
locations, with all of them being globally advertised. We used two of the UltraDNS any-
cast servers as a representative cases of this type of configuration. We should point out
that while each of the instances could possibly peer with different ISPs, UltraDNS servers
happen to use the same ISP for all the instances of the same anycast server.
Due to the unavailability of the complete listing of UltraDNS clusters, we only consider
clusters that are reachable from PlanetLab nodes. The namesof the anycast instances,
returned in the response to specially constructed DNS queries, provide a hint to the cluster
name. For example, the nameudns1abld.ultradns.net suggests that the server
99
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
belongs to theabld (London) cluster. The location of these clusters can then beextracted
from the corresponding airport codes that show up in traceroute. Table 6.4 gives a list of
all the UltraDNS clusters reachable from PlanetLab.
While F- and K-root servers use a hierarchical setup of global and local nodes, Ul-
traDNS uses a flat setup, where BGP advertisements from all instances are globally visible
throughout the Internet. Thus, DNS requests are more evenlydistributed across UltraDNS
clusters, compared to the F-, K-root servers. Instances in Europe (abld) and Asia (eqhk)
serve a smaller percentage of nodes since fewer PlanetLab nodes are located in these con-
tinents (cf. Table 6.1). Even though UltraDNS nodes respondto both the TLD1 and TLD2
anycast addresses, the distribution of client requests across TLD1 and TLD2 is totally dif-
ferent. For example, whilepxpa receives 23% of the queries for TLD1 it receives only
7% of the queries for TLD2. To understand this behavior, we investigated whether DNS
queries directed from the same client to the TLD1 and TLD2 anycast addresses, are indeed
resolved by UltraDNS nodes belonging to the same UltraDNS cluster.
Cluster Location PercentageTLD1 TLD2
pxpa Palo Alto, CA, USA 23.1 7.5eqab Ashburn, VA, USA 20.4 10.4abld London, UK 15.6 -eqch Chicago, IL, USA 15.1 7.1pxvn Maclean, VA, USA 8.8 37.8isi Los Angeles, CA, USA 8.3 18.6
eqsj San Jose, CA, USA 4.5 18.6eqhk Tokyo, Japan 4.2 -
Table 6.4: The list of the 8 UltraDNS clusters reachable fromPlanetLab.
100
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
For a given PlanetLab node PLn, we denote the list of TLD1 and TLD2 clusters PLn
contacts, by vectorsl1 and l2 respectively. We define the correspondence (or similarity)
between these lists of clusters asl1 · l2, the inner product ofl1 and l2 vectors. A corre-
spondence of one implies that the lists are the same while a correspondence of zero implies
that the two lists are completely different. Intermediate values imply a non-empty intersec-
tion. For example, assume that a given PlanetLab node contacts clusterspxpa andabld
for TLD1 name resolution and clusterspxpa andeqab for TLD2 name resolution, then
l1 = [1, 0, 0, 1, 0, 0, 0, 0] andl2 = [1, 1, 0, 0, 0, 0, 0, 0] (following the order of clusters used
in Table 6.4). The correspondence between thel1 andl2 is then equal to:
1 · 1 + 1 · 0 + 0 · 1 + 6 · 0 · 0√2 ·√
2=
1
2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6
Fre
quen
cy
Correspondence
Figure 6.2: Histogram of correspondence between TLD1 vs TLD2 clusters contacted byPlanetLab nodes.
As Figure 6.2 depicts, the majority of the PlanetLab nodes (> 71 %) have a corre-
spondence of zero, indicating that queries directed at TLD1and TLD2 anycast addresses,
originating from the same PlanetLab node are answered by different clusters. The benefit
101
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
of this configuration is that in the event of a network outage affecting one of the anycast
addresses, the other address can be used, thus ensuring uninterrupted DNS service.
The reason why PlanetLab nodes mostly pick different clusters for TLD1 and TLD2
name resolution, is that UltraDNS uses two different carriers for TLD1 and TLD2 BGP
advertisements. Data from Routeviews [94] and traceroutesfrom PlanetLab nodes to
tld1/tld2.ultradns.net reveal that traffic to TLD1 is mostly routed via ASN 2914 (Verio)
while traffic to TLD2 is mostly routed via ASN 2828 (XO communications). This means
that UltraDNS uses Verio for advertising TLD1 and XO for TLD2. The use of two different
providers for TLD1 and TLD2 also explains why in Table 6.4 theclustersabld andeqhk
receive zero queries. XO communications has no peering points outside North America
and so queries from PlanetLab nodes in Europe and Asia are routed towards a US cluster.
6.4 Evaluation
This section examines: (1) the query latencies for the monitored servers; (2) the avail-
ability of the monitored servers; (3) the affinity of clientsto the server they are directed to
and (4) the percentage of clients not reaching the replica server that is closest to them and
the additional delay incurred.
102
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
6.4.1 Response times
Table 6.5 presents the mean, median, and standard deviationof query latencies for the
monitored servers over the whole measurement period. Figure 6.3 shows the response
time CDFs of the various anycast schemes. The median provides a better indication of the
expected behavior, since it is not skewed by individual clients with very high latencies. The
first observation we can make from this table is that anycast provides a sizable reduction
in query latency compared to the B-root server. The only exception to this trend is K-root.
This is due to the fact that even though K-root has multiple servers, they are located in
Europe and the Middle East, while most of the PlanetLab nodesare in North America.
Second, TLD1 has the lowest latency, even though F-root has more deployed servers. The
reason is that only two of the F-root servers haveglobal scope and therefore client requests
may have to travel to a server that is further away. On the other hand, UltraDNS does not
use scoping and client requests are distributed among a larger set of geographically diverse
servers leading to shorter round trip times. Furthermore, the median latency for TLD1 is
lower than that of TLD2 since clustersabld andeqhk are not reachable for the TLD2
anycast address, as shown in Table 6.4. Therefore queries toTLD2 from clients in Europe
and Asia have to travel to the US.
The last two rows of Table 6.5 represent synthetic results derived from actual measure-
ments. The min{TLD1,TLD2} row represents the average query latency for clients that
choose the closest server between TLD1 and TLD2, to direct their queries to. Remem-
ber that UltraDNS, which is authoritative for the .org and .info top level domains, uses two
103
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
Nameserver Mean Median Std. Dev.(ms) (ms) (ms)
F-Root 75 70 85B-Root 115 95 121K-Root 140 121 104TLD1 96 54 207TLD2 104 85 237min{TLD1,TLD2} 69 51 173Hypothetical unicast 45 35 13
Table 6.5: Statistics of DNS response times
anycast addresses for these domains’ nameservers. So this row represents the best case sce-
nario where a client can measure the latency to each of the nameservers and subsequently
direct its queries to the closest nameserver. Indeed, clients based on BIND 9 exhibit this
behavior [95]. The last row of Table 6.5 shows the average latency of the hypothetical zone
where all the F-root servers are directly accessible by their unicast addresses and clients
forward their request towards the closest DNS server. The latency of this zone is lower
than F-root due toscoping. As we already mentioned, scoping leads clients to pick a server
that is further away since the announcements from servers with local scope that are closer
than the global server do not reach them.
TLD1 and TLD2 exhibit the highest variance in the response times across all measured
servers. This is due to two reasons: variability in the delayof the network paths and
variability in the load on the anycast server. As we already explained, UltraDNS anycast
addresses are globally announced. In Section 6.4.3 we show that this results in clients
experiencing a higher number of “flips” (i.e. server changes), and consequently higher
104
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50 100 150 200 250 300 350 400 450
Time(ms)
Reponse time CDF for the various name servers
TLD2TLD1
F-RootK-RootB-Root
min{TLD1,TLD2}Hypothetical unicast
Figure 6.3: Response time CDF.
fluctuation in DNS response times. We also noticed periods ofintermittently high query
times followed by outages specific to theeqab cluster between Sep 30 and Oct 2 that
contributed to the high variability of TLD1 and TLD2.
6.4.2 Availability
Considering the reliance of most Internet applications on DNS, ensuring continued
availability is a prime requirement for top level name servers. Figure 6.4 is a histogram
of the percentage of queries which were unanswered by the monitored nameservers. As we
mentioned in Section 6.2, we retry individual unanswered queries twice and therefore the
results presented here indicate queries lost due to networkand server outages rather than
random packet loss.
For all the measured servers, the average percentage of unanswered queries is low (≤
105
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
0.9%). At the same time, the benefit of deploying servers in multiple locations is evident
from the fact that all anycast schemes perform better than B-Root. This is to be expected
since robustness generally increases with geographic diversity. This is the reason why F-
Root has smaller percentage of unanswered queries comparedto K-Root even-though both
of these servers use the same anycast scheme. There is however large variation between
the availability of the different anycast schemes, with F-root having overall half the losses
of TLD1.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
F-Root K-Root TLD1 TLD2 B-Root
Ave
rage
num
ber
of o
utag
es p
er d
ay
Zone
Figure 6.4: Percentage of unanswered queries by various servers.
We use the term “outage” to indicate a window of time when a node is unsuccessful
in contacting its DNS server. Figure 6.5 plots the CDF of the duration of outages for the
different servers. The first observation from the graph is that all outages last at least 20 sec-
onds because of the time granularity with which we send DNS requests. Second, outages
for the hypothetical unicast server have the shortest duration. Indeed some (20%-30%) of
the PlanetLab experienced no outages. The maximum outage time is around 100 seconds,
indicating that in the worst case, a client will get a response after contacting at most three
106
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
servers. At the same time, the mean outage duration is approximately 40 sec, two to three
times shorter from the other servers. The min{TLD1,TLD2} combined nameserver enjoys
the same benefit of shorter outage periods since clients can switch from a failed server in
one of the anycast addresses to a server in the other address.All the real world anycast de-
ployments exhibit similar distribution in outage time withF-root having the longest outage
periods. This reveals an interesting fact regarding anycast. Since anycast relies on Internet
routing, once an outage has occurred the recovery time is governed by the recovery time
of the network routing fabric. In fact,≈ 30% of the outages last more than 100 seconds.
This is a direct consequence of the results presented by Labovitz et.al regarding delayed
network convergence [96]. The outage recovery time is largely independent of the anycast
scheme used.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Seconds
Outage duration time CDF for the various name servers
TLD2F-Root
TLD1K-RootB-Root
Hypothetical Unicastmin{TLD1,TLD2}
Figure 6.5: CDF of outage duration
It appears counter-intuitive that F-root can have the smallest percentage of lost queries
and at the same time have the longest duration outages. However, outage duration is only
107
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
one part of the picture. It is also important to noteinter-outageinterval and the number
of outages which occur per server. Figure 6.6 shows that inter-outage intervals, that is the
amount of time between succcesive outages experienced by the same client. The findings
from this graph are encouraging as they show average inter-outage periods in the order of
days. At the same time, inter-outage periods for TLD1 and TLD2 are shorter than those for
F-root. This finding is supported by Figure 6.7 depicting theaverage number of outages
per day aggregated over all the clients. One can see that TLD1and TLD2 experience five
to eight times more outages than F-root. The reason why TLD1 and TLD2 have higher
percentage of unanswered queries, even though the durationof their outages is shorter is
that outages occur more frequently giving a larger total number of unanswered queries.
0
0.2
0.4
0.6
0.8
1
1 10 100 1000 10000 100000
Time (min)
Interoutage time CDF for the various name servers
F-RootK-Root
TLD1TLD2
B-Root
Figure 6.6: CDF of inter-outage duration
While at this point we don’t fully understand why UltraDNS experiences more outages
than F-root, we conjecture that this is due to two reasons: First, all UltraDNS clusters
are global. As a result, clients follow more different pathsto reach their servers and are
108
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
therefore more exposed to BGP dynamics when links fail. Second, TLD1 and TLD2 are
single-homed while F-root is multi-homed. As a result if thefirst-hop ISP of TLD1 fails,
all TLD1 clusters become unavailable. On the other hand, since F-root is multi-homed the
impact of any single ISP failure on the overall availabilityis smaller.
0
200
400
600
800
1000
1200
1400
1600
F-Root K-Root TLD1 TLD2 B-Root
Ave
rage
num
ber
of o
utag
es p
er d
ay
Zone
Figure 6.7: Number of outages observed by various servers.
6.4.3 Constancy
There is no guarantee that packets from a client will be consistently delivered to the
same anycast group member. As a matter of fact, given the implementation of anycast
outlined in Section 6.1, one expects that destinations willchange over time as routing adapts
to network changes. In this section we present our findings onserver switches (or flips) for
the monitored anycast servers. We classify flips into two categories:inter-clusterandintra-
cluster. An inter-cluster flip happens when consecutive client requests are directed to two
different geographic clusters and is due to BGP changes. Each of these clusters contains
109
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
multiple DNS servers and an intra-cluster flip happens when the same client is directed to
different members located inside the same cluster. Inter-cluster flips are due to local load
balancing at the anycast cluster. As we saw in Sec. 6.4.1, therate of flips affects the query
latency variance. Delay consistency is more sensitive to inter-cluster flips than intra-cluster
ones, because inter-cluster flips involve a change of transit route, and different routes may
have widely different delay characteristics.
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
F-Root K-Root TLD1 TLD2
Per
cent
age(
%)
Nameserver
Figure 6.8: Number of flips observed as percentage of the total number of queries sent to
each nameserver.
Figure 6.8 provides a histogram of the number of inter-cluster flips observed in various
servers. Inter cluster flips in anycast deployments using global and local servers mostly
occur between the global servers. The majority of the flips(> 90%) for F-Root are between
thePAO andSFO global clusters, while for K-Root betweenAMS andLINX. Furthermore
the total number of inter-cluster flips observed in the F-Root and K-Root nameservers is
20% lower compared to TLD1 and TLD2. We believe the reason forthis is that UltraDNS
anycast clusters are globally visible while the majority ofF-Root and K-Root are local
110
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
clusters. Therefore at a client gateway, BGP paths to a greater number of UltraDNS clusters
are available compared to F-root clusters. Hence, UltraDNSserver selection is more prone
to BGP changes (due to path failures).
Nameserver Flips linked to an outage(%)
F-Root 65K-Root 63TLD1 52TLD2 51
Table 6.6: Percentage of flips due to outages.
Flips and outages can often be correlated. For example, on Sep. 21st (cf. Fig. 6.11), a
considerable number of PlanetLab nodes faced outages in theservice from theSFO cluster
of the F-Root server. After a brief outage of over a minute, service resumed with nodes
contacting thePAO cluster for F-root name resolution instead. Similarly, on Sep. 27th for
the K-Root server, all PlanetLab nodes using theAMS cluster experienced an outage. After
a brief interval spanning over two minutes, all these nodes flipped to theLINX cluster.
However, flips need not necessarily occur immediately afteroutages. To investigate how
strongly flips are correlated to server outages, we counted the number of flips that are
linked to an outage. When a client flips to a different server after the server it was using
becomes unavailable and later flips back to the original server, we say that these two flips
are related to the server outage. As Table 6.6 shows, in the case of TLD1 and TLD2
UltraDNS servers, the occurrence of flips and outages are related to a lesser extent. Since
UltraDNS clusters are all global nodes, flips are more frequent and half of the time occur
111
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
independently of outages. We believe two causes are behind the remaining flips: path
changes in Internet routing and path failures recovered by the routing infrastructure within
the inter-query interval (25-35 seconds).
The percentage of flips across all the servers is very small, indicating that they offer
a stable service. We are also interested in the time that PlanetLab nodes remain stable to
the same server. We found that there is a range of 5 orders of magnitude in this metric!
As Figure 6.9 illustrates, while the mean time a node remainsstable to the same server is
around 100 minutes, the lowest 10% of the nodes change servers every 1 minute, while the
most stable clients consistently choose the same server fordays or weeks. This behavior is
evidence that a small number of network paths are very stablewhile most other paths suffer
from outages and a small percentage of paths have a pathological number of outages.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 1 10 100 1000 10000 100000
Stability time (Minutes)
CDF of stability time
F-RootK-Root
TLD1TLD2
Figure 6.9: Period of time that PlanetLab nodes query the same server for the monitoredservers.
Furthermore, for servers that use global and local nodes (i.e. F- and K-root), we inves-
tigated if global nodes offered more stable service or vice versa. As Figure 6.10 indicates,
112
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
global nodes are more prone to switches as we already mentioned. We believe the reason
for this behavior is that the network paths to global nodes are longer and therefore more
prone to BGP dynamics.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Stability time (Minutes)
Clusterwise CDF of stability time
F-Root LocalF-Root GlobalK-Root Local
K-Root Global
Figure 6.10: CDF of the cluster stability of F-Root and K-root.
Until now we have only discussed the wide-area load balancing aspect of anycast and
how it is affected by BGP route changes. Load balancing also occurs inside clusters, to
distribute queries among the individual servers that make up the cluster. F-root uses IGP-
based (OSPF) anycast for load balancing [89], but other configurations could use hardware
based load balancers. Load balancers use either a per-packet or a per-flow mechanism. To
discover the load balancing scheme in the nameservers, we use to our advantage the fact
that each PlanetLab sites contains multiple nodes. These nodes can be expected to contact
the same anycast cluster. The similarity between the anycast servers of single site nodes
provides a hint to the type of load balancer used within each cluster. Large correlation
between the servers contacted by the nodes of the same site, indicates a per-packet load
113
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
0
20
40
60
80
100
120
Sep/20Sep/22Sep/24Sep/26Sep/28Sep/30Oct/02Oct/04Oct/06Oct/08
Tot
al n
umbe
r of
out
ages
per
day
Date
(a) Timeline of total hourly F-root out-
ages.
0
20
40
60
80
100
120
Sep/20Sep/22Sep/24Sep/26Sep/28Sep/30Oct/02Oct/04Oct/06Oct/08
Tot
al n
umbe
r of
Inte
r-cl
uste
r fli
ps
Date
(b) Timeline of total hourly F-root flips.
Figure 6.11: Correlation of outages and flips for the F-root server. A similar correlationwas observed for the K-root server.
balancer (given a round-robin load-balancing scheme we expect that packets from each
client will be sent to all the servers inside the DNS cluster). On the other hand, low correla-
tion indicates flow based load distribution (a technique often used by load balancer hashes
the clients’ source addresses). Using this technique we discovered that all the candidate
nameservers used a flow based technique, except for the B-Root server which used a per
packet load balancer. We observed that the B-root server faced a flip every half a minute.
This is typical of a per-packet load balancing technique, where successive data packets are
sent to different servers without regard for individual hosts or user sessions. The other
servers listed in the figure experience negligible number ofintra cluster flips. Even in the
hash-based flow sharing case, intra cluster flips may occur due to variations such as OSPF
weight changes or equipment failures. In general, flow basedhashing is preferred over
per-packet load balancing as it consistently directs packets from a single client to the same
cluster member.
114
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
6.4.4 Effectiveness of Localization
As our earlier results indicate, anycast decreases query latencies by localizing client
requests amongst the various DNS server replicas. However,comparing the F-root query
latency to that of the hypothetical zone where all the servers are individually addressable
(Table 6.5) seems to suggest that anycast does not always pick the closest server. This
raises the interesting question: Does anycast always lead clients to the closest instance
among all the servers in the anycast group? If not, how much farther away is the selected
server as compared to the closest? Anycast server selectiondepends on the path selected
by BGP. These routing decisions are influenced by policies and sub-optimal heuristics such
as using the path with the shortest AS hop count and can therefore lead to suboptimal
choices. In fact, it is well known that in many cases the pathschosen by BGP are not the
shortest [97,98].
An Optimistic Estimate: Directly comparing the query times of requests sent to the
unicast addresses of all the anycast group members, to the query time of the requests sent
to the server selected by anycast is potentially flawed due toa subtle reason. As we pointed
out in Section 6.1, the unicast addresses of the DNS servers are selected from address
ranges that are different from the one used for anycast. Therefore the path from a client to
the anycast address can be different from the path to the unicast address of the same server.
We use the following technique to get around this difficulty.Our technique is based
on the fact that if traceroutes from a client to the last hop router and the anycast address
follow the same path, we can obtain a good approximation of the round trip times incurred
115
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
by a client query to each of the different clusters by using the round trip time to the last
hop router instead. Using traceroutes from the PlanetLab nodes, we found that this was
indeed the case for the F-Root and TLD2 servers, but not so forTLD1 and the K-Root.
Figure 6.12 presents the additional network latency incurred by clients following the path
to the server selected by anycast over the path to the closestserver. One can see that in both
cases the majority of the anycast queries contact their nearest cluster. About 60% of all the
F-Root requests are sent to the nearest F-cluster and 80% of the TLD2 requests are sent to
the nearest TLD2 cluster. It must be however be noted that this is an upper bound on the
optimality of the anycast path choice for F-root, as not all the anycast clusters are visible to
the PlanetLab nodes (c.f. Table 6.2)
0
0.2
0.4
0.6
0.8
1
0 50 100 150 200 250
RTT (ms)
CDF of additional round trip time
F-RootTLD2
Figure 6.12: Additional round trip time for client queries to the anycast-selected F-root
and TLD2 servers over the closest servers.
A Pessimistic estimate: We also measured the effectiveness of localization using an-
other approach, which yields a lower bound on the effectiveness of localization. First,
116
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
we calculate the geographic distance of each of the PlanetLab nodes to all the listed DNS
clusters in a zone. We do so by calculating the length of a hypothetical straight line over
the globe connecting the geographic locations of the PlanetLab node and the DNS server.
The locations of PlanetLab nodes are available through the PlanetLab website. Then, we
compare these geographic distances and determine whether the PlanetLab node contacts
the geographically closest server in that zone. While it is known that Internet paths are
longer than the direct geographic path connecting two end-points [98, 99], we assume that
all paths exhibit the samepath inflationfactor. Based on this assumption, we can directly
compare geographic distances to determine whether the bestInternet path is selected for
each client.
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60 70 80 90 100 110 120
Additional Distance (ms)
CDF of additional distance traveled
F-rootTLD1TLD2
K-root
Figure 6.13: Additional distance over the optimal traveledby anycast queries to contact
their F-root, K-root, TLD1 and TLD2 server.
Figure 6.13 presents the cumulative distribution of the additional distance across all
PlanetLab nodes for each zone. We observe that about 37% of all the anycast requests are
117
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
sent to the nearest F-root server while 35% of the anycast requests are sent to the nearest
K-root server. Approximately 75% requests are served by the nearest TLD1 and TLD2
servers. In fact, the CDF for TLD2 closely matches with that in Figure 6.12. However, this
is not the case with F-Root, because not all the clusters are visible from Planet Lab and
consequently not accounted for in Figure 6.12.
Using these two estimates, we can conclude that the effectiveness of localization for
the F-Root is between 37%-60%,≥ 35% for the K-Root, while for TLD1, it is≥ 75%
and for TLD2 between 75%-80%. It is not surprising that TLD1 and TLD2 zones perform
considerably better than other deployments. Not only a larger portion of nodes contact the
closest server but the additional distances for those that don’t, are also shorter. The reason is
that UltraDNS clusters are not differentiated into global and local. Consequently, PlanetLab
nodes have visibility to a greater number of BGP routes to UltraDNS clusters. Therefore,
it is more likely that anycast chooses the nearest UltraDNS cluster. In a somewhat counter-
intuitive way, the slowest 10% of TLD1 clients follow worse paths compared to TLD2
even-though TLD1 is advertised by two additional locates (London, Tokyo). We explain
this behavior by an example. Consider a client in Asia. If it doesn’t pick the HK site
for TLD1, its requests are directed to the US. Thus the large additional distance. On the
other hand, TLD2 is not advertised locally from HK and therefore clients correctly pick the
US sites. The inverse effect is visible for K-Root. Clients don’t traverse large additional
distances compared to the closest cluster due to the fact that all their clusters are located
within a relatively small geographical area.
118
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
6.4.5 Comparison of Deployment Strategies
Our study shows that we can categorize the existing anycast configurations into two
schemes: hierarchical and flat. The hierarchical scheme distinguishes anycast nodes into
local and global, while in the flat scheme all the nodes are globally visible. Anycast servers
in the flat configuration tend to have a more uniform distribution of load. Also, since there
is a greater diversity of choices of available anycast servers to a client, the distance between
clients and DNS servers is generally shorter, as seen in Section 6.4.4. Consequently, ma-
jority of the clients also have low query latency as reflectedin the low median query time
of TLD1 and TLD2 anycast servers, in Section 6.4.1.
However, in Section 6.4.2 we show that the flat scheme is more prone to outages. Even
though the outage durations follow similar distribution for both schemes, given that it is
a function of the BGP convergence time, the frequency of the outages is lower for the
hierarchical scheme. That is possibly due to the fact that inthe case of the flat scheme
more instances are globally visible in the routing tables, and thus they can potentially lead
to path changes triggered by other network events. Furthermore, in Section 6.4.3 we show
that having a large radius of advertisements has an adverse effect on the stability of the
response times and increases the frequency of server changes (flips) of the anycast service.
This is because the larger the radius of advertisements is, the greater is a server’s sphere
of influence. This consequently increases the number of choices of servers available at a
client.
We believe that an ideal anycast scheme would involve deploying a small number of
119
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
global nodes accompanied by a larger group of local nodes. The radius of advertisement
of the local nodes can be dynamically varied in order to maintain a minimum degree re-
dundancy and fast failover. In the section that follows we sketch out how such a dynamic
scheme could be implemented and evaluate its performance via simulations.
6.5 Effect of Advertisement Radius
Here, we investigate the relation between varying the advertisement radius of a server
and its effect on the load on a particular server and the anycast query latency. We simulate
the AS level topology of the Internet using the connectivitydata available from Route
Views [94]. Server placement is done based on the actual placement of F-root servers
available from [91]. An initial advertisement radius is assigned to each of the servers. If a
server has a radius ofr then its prefix advertisement is visibler AS hops from the origin AS
of this server. Finally we position 200 clients randomly across the set of all autonomous
systems. While we understand that this setup is not a true representation of the distribution
of DNS clients over the Internet, it nevertheless serves ourgoal of studying the effect of
advertisement radius on the load experienced by the servers. Each client selects the server
with the shortest AS path among all the visible paths.
120
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
Algorithm 1: Radius adjustment algorithm
(1) Radius[1 . . . Num of servers]← 5.
(2) Calculate Redundancy
(3) S ← server with max load.
(4) while Redundancy>= 1 do
(5) Radius[S] = Radius[S]− 1
(6) Calculate load on each server.
(7) Calculate Redundancy
(8) S ← server with max load.
We use the term “redundancy” to denote the minimum number of servers that is reach-
able by any client. At the beginning of the simulation, we fix the radius of all servers to
be equal to a sufficiently large value (we used an initial radius of five). We then gradually
reduce the radius of the server with the maximum load, thus confining the server to serve
smaller communities using AlgorithmWe iterate this process until there exists some client
which is outside the sphere of influence of all the servers, i.e. it has redundancy of zero.
Figure 6.14 plots the load on the maximally loaded server as afunction of the average
radius. Initially, when each server has radius equal to 5, every client can reach at least ten
servers, while the busiest server serves 80% of the traffic. As we decrease the radius of
this server, its load decreases until another server becomes the maximally loaded server.
121
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
10
20
30
40
50
60
70
80
90
3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5%
load
Average Radius of announcement
R = 1
R = 2
R = 3R = 4
R = 10Maximum Load on a server
Figure 6.14: Variation of server load with varying server advertisement radius for a randomdistribution of 200 clients. Redundancy is denoted by R.
Intuitively, as the average advertisement radius decreases the maximum number of clients
served by a single host also decreases, thereby distributing excess load to other servers. At
the limit, the highest loaded server receives about three times the optimal load (if clients
were evenly distributed across servers). Based on this graph, we can see that an ideal
operational region exists where the maximum server load is low while redundancy is greater
than one. While this result is encouraging, it indicates that an adaptive mechanism is needed
to minimize server load while keeping adequate redundancy levels. As far as we know,
zones employing the global/local hierarchy don’t use such amechanism today.
We also calculated the average path length as a function of the radius presented in
Figure 6.15. Initially clients have to travel a distance of approximately two ASes to reach
their closest server. However, as the average radius decreases, the path length increases and
consequently query latency also increases. The step-wise increase in path length shown in
Figure 6.15 path length as is due to the nature of the Internetgraph. A small number of
ASes have extremely high degree and have very short distanced to the majority of the
122
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5A
vera
ge P
ath
Leng
th
Average Radius of announcement
Average Path Length
Figure 6.15: Variation of average AS path length with changein the radii of the server fora random distribution of 200 clients.
other ASes [100]. As long as the radiusr of a DNS server located in one of these “hub”
autonomous systems is higher thand most of the clients are directed to this server. When
r < d then clients are directed to a more distant server and thus the average path length
increases.
6.6 Related Work
A number of existing studies have looked at the performance of the DNS infrastructure.
Danziget al. presented measurements related to DNS traffic at a root name server [101].
Their main result was that the majority of DNS traffic was caused by bugs and misconfig-
urations. Most of these problems have been fixed in recent DNSservers. Anycast was not
yet used for DNS name resolution back then. More recently, Brownleeet al. monitored
the DNS traffic from a large campus network and measured the latency and loss rate of
queries sent to the root nameservers [102]. Their main goal was to create a model of DNS
123
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
request/response distribution. Our results on average latencies and loss rates match those
presented in that study. Interestingly, the authors of [102] observed that query times show
clear evidence of multipathing behavior and conjectured that this is due to load balancing
or changes in server load. Anycast at the BGP level and withina cluster, is a key cause
of this observed multipathing. Panget al. [103] measured the availability of the individual
DNS authoritative and caching servers, and studied the different server deployment strate-
gies. [104] presents some early results on their DNS anycaststability experiment using a
large number of vantage points on the Internet. While this isprobably the closest peer
related work and our results generally agree, we focus on different anycast deployment
strategies, and how they affect the performance of the anycast servers points spread around
the Internet.
Junget al. measured the performance of all DNS requests sent from the MIT campus
and investigated the effect of caching on DNS performance [105]. Wessels et al. com-
pared the effect of different caching techniques on the rootnameservers [95]. Their results
show that some caching servers favor nameservers with lowerround-trip times while others
don’t. This indicates that the use of anycast benefits at least some resolvers since it transpar-
ently leads them to (approximately) the closest instance. On the other hand, resolvers that
actively select the DNS server with the closest distance would see a performance benefit if
the unicast addresses of the servers were exposed as we showed in Section 6.4.1.
The effectiveness of anycast in providing redundancy and load sharing has been ex-
ploited in a number of proposals. The AS112 project reduces unnecessary load on root
124
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
nameservers by directing queries for local zones to a distributed black hole implemented
via anycast [106]. The use of anycast has also been proposed for finding IPv6 to IPv4
gateways [107] and to implement sink holes for the detectionand containment of worm
activity [108]. Engel et al. provide results from their measurement of load characteristics
on a set of mirror web sites using anycast [109]. Hiteshet al. present a scalable design for
anycast and use a small subset of the PlanetLab nodes to measure the affinity of existing
anycast deployments [110]. While this work has some similarity to ours, their focus is on
the design of an anycast scheme. Finally, a number of proposals have looked at alternatives
to the existing DNS architecture with the goal of improving query performance [111,112].
6.7 Summary
In this chapter, we presented an analysis on the impact of anycast on DNS based on the
measurement of five top-level servers. We found that overall, the deployment of anycast
is beneficial for the DNS infrastructure since it decreases the average query latency and
increases the availability of the DNS servers. However, ourstudy shows that while the
number of outages is relatively small, some of them are long in duration (≈ 30% last more
than 100 seconds), affected by BGP routing convergence times. Moreover, we identified
two different anycast schemes, currently deployed in DNS, and we show that these dif-
ferent deployment strategies play a key role in determiningthe optimality and robustness
of anycast. Finally, we uncovered a trade-off, in which increasing the number of globally
125
CHAPTER 6. ON THE USE OF ANYCAST IN DNS
visible nodes increases the percentage of queries being directed to the closest cluster, but
at the same time de-stabilizes the service offered, in termsof increased server switches and
unanswered queries.
While this trade-off is clear from the results presented here, we don’t fully understand
the underlying mechanisms that connect the scope of BGP advertisements, the rate of flips,
and the duration of outages. To do so, would require access tothe BGP advertisements at
each monitoring point that were unfortunately unavailable. We are currently developing
a theoretical model for the effect of link failures on service outages that we plan to vali-
date via simulations. We believe that this model, coupled with access to the actual BGP
advertisements, will provide deeper insight into the operation of anycast and the trade-offs
involved.
Acknowledgements
Joe Abley graciously responded to our queries regarding theimplementation of anycast
in the F-root servers. We would also like to thank Lixia Zhang, Claudiu Danilov and
Alexandros Batsakis for their valuable comments.
126
Chapter 7
On the Effect of Router Buffer Sizes on
Low-Rate Denial of Service Attacks
Internet routers employ queues to buffer packets during periods of congestion. Until
recently, the size of buffers for TCP dominated links was determined using the rule of
thumb proposed by Villamizaret al. in [113]. According to this rule, the sizeB of a
buffer is given byB = RTT × C, whereRTT is the average round-trip time of the flows
traversing the link andC is the link capacity. While this rule of thumb was widely accepted,
Appenzelleret al. recently showed, based on TCP flow de-synchronization dynamics, that
queue size can be actually reduced without sacrificing utilization [19]. GivenN flows, they
show that a buffer of sizeB′ = RTT×C√N
, suffices to maintain utilization close to 100% for
drop tail queues. Since this result depends primarily on thede-synchronization of TCP
flows sharing the same queue, it is believed to extend to otherqueuing schemes such as
127
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
RED [114].
RED was the first in a series of Active Queue Management (AQM) schemes that use
increases in queue size to detectincipient congestionbefore the queue becomes full. Sub-
sequent extensions to RED, e.g. RED-PD [21] etc., attempt toachieve a fair allocation of
resources among potentially selfish or malicious flows sharing the same link. Malicious
flows may violate the TCP congestion control algorithm in order to selfishly maximize
their throughput or cause denial of service (DoS) attacks, thereby minimizing the through-
put received by TCP flows sharing the same link. Since the majority of AQM schemes
maintain partial flow state for reasons of scalability, larger buffer sizes translate to more
accurate per-flow statistics and therefore higher probability of detecting misbehavers. This
brings us to the main question we address in this chapter:While buffer size can be reduced
without affecting link utilization, does this reduction make the detection of misbehavers
harder?To test the vulnerability of smaller buffer queues to misbehaving sources, we use
a recently proposed class of DoS attacks calledshrews[20]. These malicious flows send
short periodic bursts of traffic trying to fill up the buffer and force TCP timeouts, thus throt-
tling the throughput of TCP flows. We chose this type of attackbecause shrews are difficult
to detect due to their low average sending rate.
We use a mathematical model to show that smaller queues are indeed vulnerable to
shrew attacks. However, increasing the buffer toB′′ = mB′,m <<√
N , is sufficient
to drive the shrews’ average transmitting rate required to cause a DoS attack considerably
higher than the min-max fair rate. When this happens, shrewscan be detected by an AQM
128
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
scheme such as RED-PD and consequently penalized without affecting compliant flows.
We validate our analysis using simulations in two differentscenarios: (a) a 10 Mbps link
shared by 20 flows and, (b) a 155 Mbps link shared by 250 flows.
The rest of this chapter is structured as follows: We briefly introduce shrew attacks in
Section 7.1. Section 7.2 provides a mathematical analysis of the effect of increasing buffer
size on the sending rate of the shrews. Validation of the analysis through simulations is
shown in Section 7.3. Related work is presented in Section 7.4 and we conclude in Section
7.5
7.1 The Shrew Attack
We begin with a brief description of the shrew attack. A detailed discussion on shrews
can be found in [20]. Consider a bottleneck link shared by a large number of TCP flows. A
low-rate shrew DoS attack is a periodic burst of traffic (e.g.square wave pattern) such as
the one shown in Fig. 7.1. The shrew transmits at a high rate ofP bps, for a short period
of time l sec. For the rest of the time, it transmits at a much lower rate(almost zero). This
behavior repeats with a period ofT sec. The average rate of a typical shrew is given by by
P ∗ l/T . Since the ratiol/T is small, the shrew appears to be a well-behaved over larger
timescales, thus evading detection.
The shrew works by keeping the buffer full for a sufficiently long time (typically in
the time scale of the flows’ RTT), causing the router to forcefully drop multiple packets
129
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
Pea
k R
ate
(P)
Sen
ding
Rat
e
Time
Burst Time l( )
Period T( )
Figure 7.1: Square-wave shrew
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 50 100 150 200 250 300 350 400 450
Thr
ough
put (
Nor
mal
ized
)
RTT(ms)
No ShrewShrew
Figure 7.2: Effect of a single shrew on TCP throughput as a function of the RTT of flowssharing a DropTail queue.
from the same TCP flow. At this point, TCP flows will try to retransmit the packets after
a retransmission timeout (RTO). By setting its periodT to be equal to the TCP flow’s
minRTO (the authors of [115] suggest that all TCP flows shouldset their minRTO to 1
second) the shrew causes the retransmitted packets to also be dropped. Subsequently, TCP
does an exponential back-off, dropping its congestion window to one and doubling its RTO.
Since the new RTO is also a multiple ofT , the flow experiences repeated packet losses.
Typically, the lower RTT flows are penalized more heavily than the higher RTT ones. As
130
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
N The number of flows sharing the link.RTT The average RTT of the flows.C The link capacity.B The buffer size given by
B = m · RTT ·C√N
, m ≥ 1.B0 = γ · B The instantaneous queue size when the
shrew attack is launched.s The number of shrews.P The burst rate of a single shrew.l The burst time.T The period of the shrew.
Table 7.1: Notation used in the mathematical analysis of theshrew attack.
Figure 7.1 shows (recreated from Figure 7 of [20]) shrews canconsiderably decrease the
throughput of competing TCP flows.
7.2 Mathematical Analysis
We present here a simple fluid model used to analyze the effectof increasing buffer
size on shrew attacks. In our analysis we assume an idealizedAQM scheme that is able to
detect and penalize flows sending traffic at rate higher than their fair rate. The notation we
use is listed in Table 7.1.
Consider an attack withs shrews, is launched on a link used byN TCP flows. We
further assume that the goal of the shrew attack is to limit all the flows with RTT< ρ sec.
Then, the minimum amount of total incoming traffic required to keep the buffer full forρ
sec, is given by:
131
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
Input traffic= B − (B0 − C · ρ) = C · ρ + (B − B0) (7.1)
As shown in [20], if a shrew attack is launched on a link sharedby a large number of
TCP flows, and the shrew attack limits all flows with RTT< ρ, TCP flows with larger RTT
may consume the additional capacity. Furthermore, other background traffic sharing the
link such as short TCP and UDP flows, which are unaffected by the shrew, also aid the
shrew in filling up link. Consequently, the shrews need to send less traffic than shown in
Eq.(7.1). In the worst case, when the link is completely utilized by background traffic, the
shrews must at least account for(B − B0). Therefore, for a time periodρ:
Shrew traffic≥ B −B0 = m · (1− γ) · RTT · C√N
(7.2)
whereγ is the fraction of the buffer that was full at the beginning ofthe shrew attack.
The fractionγ depends on factors such as the queue type and the traffic mix traversing the
link. Note that the total traffic sent by the shrews during this time is equal to(P ∗ l) ∗ s.
Thus, we can rewrite Eq.(7.2) as:
(P · l) · s ≥ m · (1− γ) · RTT · C√N
(7.3)
One can see from Eq.(7.3) that if we increase the size of the buffer by using a larger
constantm′ = m + ∆m, the peak rate of each shrew must increase by:
132
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
∆P ≥ ∆m · (1− γ) · RTT · C√N · l · s
(7.4)
Eq.(7.4) reveals that with a unit increase in the multiplicative factorm, each individual
shrew needs to increase its sending rate by an order ofO(1/√
N). Given that the fair
bandwidth of a flow isfbw = O( CN
), a small increase inm, causes the sending rate of each
shrew to be higher thanfbw, whereby the shrew is no longer a low-rate attack and will
therefore be detected by the AQM mechanism. Note that for high speed links,∆m <<
√N , and so the buffer size still remains<< RTT · C. Furthermore, asN increases, the
fair bandwidth allotted to each flow decreases. Consequently, the average sending rate of
the shrew is much higher than the fair bandwidth and the shrewis easier to detect.
We use a typical scenario as an illustration. Consider an OC3link (155 Mbps) carrying
150 TCP flows, withγ = 0.7, RTT = 250 ms, andl = 100ms. In this case the additional
buffer space that needs to be filled by the shrews for a unit increase ofm, is ∆ = (1 −
γ) ∗ RTT∗C√N≈ 1 Mb. Therefore, ifs = 5, each individual shrew needs to increase its peak
sending rate by 2 Mbps for a unit increase inm. Consequently, the average sending rate of
a shrew increases by2Mbps ∗ l/T = 2 ∗ 0.1/1 = 0.2 Mbps. The min-max fair bandwidth
for the link is 155/150≈ 1 Mbps. Thus, choosingm = 5 (<√
150 ≈ 12.24) is sufficient
to drive the sending rate of a single shrew sufficiently high,whereby the shrew will be
detected by an AQM scheme (e.g. RED-PD) which provides approximate fairness.
133
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
7.3 Evaluation
We use ns-2 simulations to verify our mathematical analysis. Figure 7.3 shows the
classic dumb-bell topology we used in our simulations, withtwo sets of sources and sinks:
the first set consists of TCP source/sink pairs while the second set consists of shrews. All
TCP flows are long duration SACK flows. We used SACK because it was found to be the
most resistant version of TCP to the shrew attack [20]. TCP sources start at a random time
between [0,10] sec while the shrew attack starts at 100 sec, to allow the TCP flows to reach
steady state.
All the source-sink pairs are inter-connected by the bottleneck linkr0 → r1. The links
delays and speeds are shown in Figure 7.3. All the sinks have aone-way delay of 1 msec
to routerr1. The one-way propagation delay of the TCP sources tor0, uniformly increases
from 0 to 220 msec. Therefore, the round trip time ranges uniformly from 20 msec to 460
msec as suggested in [116].
..
.
.
........
110ms
Sources Sinks
1000 Mbps 1000 Mbps1 ms
.
n
Shrew source/sink
TCP source/sink
r r0
..
.
Figure 7.3: Dumb-bell configuration.
We set the buffer size of ther0 → r1 link to be equal toB = m ∗ RTT×C√N
and we vary
134
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
m from 1 to√
N , to measure the effect of increasing buffer size on the throughput of the
TCP flows and the sending rate of the shrew. All the links have drop-tail queues except
ther0 → r1 link which uses RED-PD [21]. RED-PD uses a configurabletarget round trip
time R to derive the average sending rate of compliant TCP flows using the deterministic
model of TCP from [117]. According to this model, the sendingrate of a compliant TCP
flow is BR =√
1.5R√
p, wherep is the ambient loss rate computed over the recent history of
packet losses. Flows whose sending rate is higher thanBR are identified as misbehaving
and are monitored. The advantage of increasingR is that more misbehaving flows can
be identified. On the other hand, doing so increases the required amount of per-flow state
which is proportional to the increase inR and the number of flows traversing the link.
In our simulations we use multiple values ofR to evaluate the sensitivity of RED-PD in
detecting the shrews.
In the following paragraphs, we present the results from twodifferent scenarios, with
different link speeds and number of flows to investigate the effect of increasing buffer size
on the throughput of TCP and the transmitting rate of the shrews. All the results are based
on at least 400 sec of simulations.
7.3.1 Low Speed Link
This scenario is similar to the one used in the original shrewattack paper [20]. The
capacity of bottleneck link is set to 10 Mbps and link is shared by 20 TCP flows and a
single shrew. The RED-PD thresholdR is set to 40 msec. We use the shrew parameters
135
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
from [20], whereP = 10 Mbps, l = 200 msec, andT = 1.2 sec, for an average sending
rate of 1.67 Mbps. Given a shrew with parametersp, l, T , we define an equivalent CBR to
be a CBR flow transmitting at a constant rate ofpl/T , equal to the average sending rate of
the shrew. We then compare the normalized throughput as a percentage of the link capacity
that each TCP flow achieves when it competes with a shrew to thethroughput achieved
when the shrew is replaced by the equivalent CBR flow.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 50 100 150 200 250 300 350 400 450 500
Nor
mal
ized
Thr
ough
put
RTT(ms)
Shrew,m=1CBR,m=1
Shrew,m=4CBR,m=4
Figure 7.4: TCP throughput as a function of the RTT under increasing buffer sizes. Unlessotherwise specified,R = 40 msec.
Figure 7.4 plots the throughput obtained by the different TCP flows as a function of
their RTT. The first point to be noted in the graph is the well-known negative bias of TCP
to flows with high RTT. The more interesting point however, isthe reduced sending rate of
TCP flows across the whole RTT range when the shrew is active and the buffer size is small
(m = 1). The low RTT flows are more adversely affected. However, TCP sending rate
increases withm. Whenm = 4, the throughput of the TCP flows is approximately equal
to that, when the shrew is replaced by the equivalent CBR source. This result indicates that
136
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
the higher buffer size is indeed effective in minimizing theeffect of the shrew.
R=40 msec R=120msecm CBR Shrew CBR Shrew1 83% 43% 86% 79%2 86% 65% 87% 80%3 86% 78% 88% 81%4 86% 82% 88% 81%√
20 ≈4.5 86% 83% 88% 81%
Table 7.2: Aggregate link utilization from 20 TCP flows.
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0 50 100 150 200 250 300 350 400 450 500
Nor
mal
ized
Thr
ough
put
RTT(ms)
Shrew,m=1CBR,m=1
Shrew,R=40ms,m=2Shrew,R=120ms,m=2
Figure 7.5: WhenR increases to 120 msec, it is possible to have small buffer size (m = 2)without penalizing the TCP flows sharing the link with the shrew.
Figure 7.5 shows the effect of a unit increase inm on the throughput of the TCP flows.
Whenm = 2, the throughput of TCP flows with low RTT increases. However, there is
still some negative effect of the shrews on the higher RTT flows. Table 7.2 shows the
percentage of link capacity utilized by the 20 TCP flows for different values ofR. The
Shrew column corresponds to the throughput obtained by the TCP flows under a shrew
attack while the CBR column shows the throughput for the equivalent CBR flow. As seen
from the table, the TCP throughput is gradually restored with increase inm. Whenm = 4,
137
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
the negative effect of the shrew on the TCP flows is minimal1. The table also shows that
using a largerR value (120msec) requires a smaller increase in the buffer size to mitigate
the shrew attack because RED-PD withR=120msec is a better fairness approximator than
RED-PD withR=40msec. The reason is that whenR = 40msec, RED-PD only detects
flows whose RTT≤ 40 msec. For the same reason, that the results (not shown here) for a
simple RED queue are similar to that of RED-PD withR=40msec. Since most of the flows
in this experiment have higher RTT, RED-PD emulates RED for the majority of the flows.
Consequently, whenR = 120 msec, the throughput attained by the TCP flows is closer to
the fair throughput allocation (C/N = 10/21≈ 470Kbps). Of course, settingR equal to
the maximum RTT among the TCP flows would achieve perfect fairness and completely
mitigate the shrews, with the least increase in the buffer size. The downside is that since
the amount of packet drop information stored by RED-PD is proportional to the number of
flows andR, largerR values result in higher state overhead.
From this first experiment, one may incorrectly suspect thatin order for the shrew to be
neutralized,m ≈√
N . This is however an artifact of the small number of TCP flows (20)
sharing the link in this experiment. To show that small values of m are indeed adequate,
we repeated the experiment using a higher number of flows on a faster link.
1Utilization of 86% and 88% for TCP whenm = 4.5, indicates that the competing flow is having highnumber of losses and the TCPs are able to utilize the additional capacity
138
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
7.3.2 High Speed Link
Next, we consider a more realistic scenario, where the bottleneck link is an OC-3
(155Mbps) link shared by 250 TCP flows. We use ten synchronized shrews (≈ 4% of
the total number of flows). This way any single shrew has a lower average sending rate and
is more difficult to detect. For each shrewP = 20 Mbps,l = 200 msec, andT = 1.2 sec,
implying an average sending rate of 3.33 Mbps. Therefore, all the synchronized shrews
have an aggregate peak rate of 200Mbps for a burst time of 200 ms. Table 7.3 shows the
link utilization due to TCP flows obtained with shrews and with shrews replaced by equiv-
alent CBR flows. As seen in the previous simulation, the utilization steadily increases with
m. The corresponding bandwidth plot in Figure 7.6, illustrates that asm increases to≈ 3,
the sending rate of the low RTT flows is restored to the no-shrews scenario. As in the pre-
vious subsection, we repeat the experiment withR = 120 msec. As expected, increasing
R increases the throughput obtained by the high RTT flows, thereby improving the overall
link utilization considerably. In this case, we see that with R = 120 msec andm = 5, we
achieve TCP utilization as good as whenm = 16. The same effect is evident in Figure 7.6.
R=40 msec R=120msecm CBR Shrew CBR Shrew1 82% 52% 81% 53%3 82% 62% 82% 58%5 83% 68% 83% 82%8 82% 70% 82% 81%12 81% 76% 81% 81%√
250 ≈16 80% 77% 80% 80%
Table 7.3: Aggregate TCP link utilization for 250 flows.
139
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0 100 200 300 400 500 600 700 800
Nor
mal
ized
Thr
ough
put
RTT(ms)
m=1m=3m=5
m=16Shrew,R=120ms,m=5
Figure 7.6: TCP throughput under increasing buffer sizes.
0
0.5
1
1.5
2
2.5
3
3.5
1 1.5 2 2.5 3 3.5 4 4.5
Nor
mal
ized
Ban
dwid
th
m for 10 Mbps,20 flows
Peak RateAverage Rate
0
0.05
0.1
0.15
0.2
0.25
0.3
2 4 6 8 10 12 14 16
Nor
mal
ized
Ban
dwid
th
m for 155 Mbps, 250 flows
Peak RateAverage Rate
Figure 7.7: Peak and Average Shrew sending rate needed to maintain reduced linkutilization.
Eq.(7.4) in Section 7.2 shows that for the shrew attack to be effective, the peak rateP
as well as the shrew’s average rate has to increase linearly with m. We verified this analysis
via simulation. The graphs in Figure 7.7 plot the shrews normalized sending rate, required
to maintain the same low link utilization as whenm = 1. We see that in the case of 20
flows, whenm = 3 the average sending rate of the shrew is approximately 50% ofthe link
capacity. For 250 flows, withm = 5, each shrew needs to send at about 5% of the link
140
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
capacity, i.e 50% (sinces = 10) of the link traffic must be shrew traffic. This implies that
the shrew traffic is no longer a low rate traffic but actually a high rate DDoS attack, even
with a relatively poor choice ofR = 40 ms. Unlike shrews, high rate DDoS attacks are easy
to detect and several schemes exist to contain them, e.g. [118]. The required shrew sending
rate increases more slowly whenm > 5. The reason for this is as follows: To maintain
TCP utilization equal to that whenm = 1 (≈ 50%), it is enough for the competing shrews
to fill up 50% of the link, irrespective of the buffer size. This can be achieved by sending
at a rate slightly greater than 50% of the link speed.
As seen from the last two subsections, a small increase inm is sufficient to improve the
throughput of the high RTT TCP flows to almost the levels of no-DoS attack. However,
there is still some effect of the shrew attack seen in terms ofthe reduced utilization of the
link capacity. As described in [21], setting the target bandwidth R of the RED-PD router to
be very large(≥ maxRTT ) guarantees perfect fairness, but the state to be maintainedis of
O(N ∗R), whereN is the number of flows and hence scalability is an issue. Basedon the
mathematical analysis and simulations setting a moderate target bandwidth ofR(≈ 120ms)
and increasingm by a small value(≈ 5) provides comparable performance with the no-
shrew scenario, both in terms of utilization and fairness. We believe that the effect of
shrew attacks can mitigated by using this double pronged strategy. In order for DoS attacks
to throttle TCP traffic, shrews need to send at a considerablyhigh proportion of the link
capacity, whence the attack is no longer a low rate attack andeasier to detect.
141
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
7.4 Related Work
There has a been a plethora of AQM schemes inspired by RED, theseminal work in
this area [114]. Quite a few of the schemes aim to provide approximate fairness. However,
fairness comes at the cost of maintaining additional state at the router. To minimize state
overhead, AQM schemes use only partial flow state. Hence, it is possible for false neg-
atives to occur and malicious flows can exploit these weaknesses to gain undue resource
advantages. An example of such malicious flows, is theshrew attack, a low-rate TCP tar-
geted denial of service attack [20]. Various counter-measures to the shrew attack have been
proposed. The notable being:(a) RTO randomization; however [20] argues that shrews
can still filter out portions of TCP traffic.(b) Router level DDoS solutions such as IP trace-
back [119] and Pushback [118], and(c) AQM modifications such as HAWK [120], which
makes the policy decision to penalize all bursty flows. However, not all bursty traffic is
malicious. For example, short TCP flows would be unduly penalized by HAWK due to
their bursty nature. Rather than proposing a new AQM scheme we show that a moderate
increase in buffer size, coupled with the use of RED-PD is sufficient to minimize the impact
of shrew attacks on TCP traffic.
7.5 Summary
This chapter studies the effect of buffer sizes on the power of low rate TCP targeted
DoS attacks to “shut-down” competing TCP traffic. Using a simple mathematical analysis
142
CHAPTER 7. ON THE EFFECT OF ROUTER BUFFER SIZES ON LOW-RATEDENIAL OF SERVICE ATTACK
coupled with simulations we showed that a relatively small increase in the buffer size can
mitigate the effect of shrew attacks on TCP traffic. The intuition behind our result is simple:
As the buffer size increases, shrews need to fill a larger buffer to cause multiple TCP packet
drops. This means that the shrews need to transmit at high speeds, at which point they are
no longer low-rate attacks and can be detected by existing AQM schemes such as RED-PD.
Acknowledgements
We thank Razvan Musaloiu-E. for the discussions about RED-PD, and Kishore Kotha-
palli for patiently reviewing the mathematical analysis ofthe shrew attack.
143
Chapter 8
Future Work
The work described in this thesis, contributes towards the goal of countering emerging
malware threats and quantifying the robustness of the Internet in withstanding attacks. In
general, we believe the task of ensuring a secure system is a challenging one. Intrusion
agents such as worms, trojans etc. are constantly evolving into more developed forms
and using emerging technologies as delivery vehicles. In this sense, security research can
be likened to an arms race, with miscreants pitted against security providers. To remain
a step ahead, a pro-active approach to forecasting and countering threats is essential. In
accordance with this stance, we now discuss avenues for future research, using insights
from the work in this thesis and what needs to be accomplished.
144
CHAPTER 8. CONCLUSIONS AND FUTURE WORK
8.0.1 Botnets
The past few years have seen a shift from the traditional IRC style botnets to decen-
tralized and stealthier architectures. As chapter 2 shows,the current P2P bot tracking
techniques are inadequate. Current practices allow miscreants to identify botnet track-
ers/monitors employing simple heuristics. Such a capability is dangerous, as botnets be-
come more powerful. As already witnessed, botmasters are not averse to carrying out DoS
attacks which effectively take down parties inimical to their own interests [13]. While co-
operatively monitoring botnets holds promise to effectively detect and track P2P botnets,
containment of the malware is still quite difficult. The onlysure way known to contain P2P
botnets today, is to deal with infected machines on an individual basis using an Anti-virus.
While network-centric techniques such as index poisoning work quite well to deter for file
sharing P2P searches, their effectiveness in the case of theautomated worm binaries is
debatable and needs to be researched further.
Another emerging trend in the botnet ecosystem is the use of HTTP for C&C com-
munication [121]. While, IRC/P2P botnet traffic is easily distinguishable at routers, and
accordingly filtered, traffic from HTTP botnets can camouflage itself in the milieu of other
web-based traffic. Furthermore, inspection of HTTP traffic is usually frowned upon in en-
terprises, due to concerns of privacy violations. DNS cachetimings and probes could come
in handy in determining outlier http servers, which are potentially malicious.
145
CHAPTER 8. CONCLUSIONS AND FUTURE WORK
8.0.2 Mobile Malware
In chapters 3 and 4, we gave a detailed account of the evolution and possible detection
techniques for mobile malware. Since the technique of random moonwalks essentially
culls out worm edges in the presence of noisy background traffic, we believe it could be
engineered to be robust even in distributed, lossy scenarios, wherein not all the domains
co-operate. However, even with effective detection mechanisms, the feasibility of policing
nodes as they enter/exit domains is not straightforward. Specifically, the design of ahard-
LAN ( [59]) to a mobile setting is challenging and an avenue for future work.
A closely related field of research deals with smartphone worms, which can propagate
using a variety of different vectors e.g., SMS/MMS, bluetooth, Wifi etc. While anti-virus
(AV) software defenses do exist for smartphones, they are limited in their effectiveness due
to reasons of limited battery power, storage and more importantly the fact that AV software
depends on signatures. One can devise, collaborative worm detection procedures similar to
the moonwalk algorithm, to combat these threats. However, the peculiarity of smartphone
networks, is the predictable nature of traffic spikes. For example, there is usually a surge
in SMS traffic around the new year [122]. Worms can exploit this surge by propagating
during that period. Most intrusion detection mechanisms which rely on historical data to
define abnormal events, can be evaded by a worm employing thisstrategy.
146
CHAPTER 8. CONCLUSIONS AND FUTURE WORK
8.0.3 Web based Malware
In recent years, the web browser has evolved from a single-principal, single-site ap-
plication to one in which a single page contains a mashup of code and data from multiple,
mutually distrusting sites. We introduced new abstractions in chapter 5 which could combat
attacks such as XSS. We believe that browsers still lack essential abstractions which could
provide an equivalent operating system style environment for websites within. Function-
ing as defacto operating systems for executing client side components of web applications,
provides isolation and methods of resource management and fault containment, while pro-
viding powerful sharing paradigms.
147
Bibliography
[1] “eEye Digital Security”, “Code Red Worm,” http://www.eeye.com/html/Research/
Advisories/AL20010717.html.
[2] R. Pang, V. Yegneswaran, P. Barford, V. Paxson, and L. Peterson, “Characteristics of
Internet Background Radiation,” inProceedingsof ACM IMC, October2004, 2004.
[3] E. Cooke, F. Jahanian, and D. McPherson, “The Zombie
Roundup: Understanding, Detecting, and Disturbing Botnets,” in
Proceedings of the first Workshop on Steps to Reducing Unwanted Traffic on the Internet,
Jul. 2005.
[4] M. A. Rajab, J. Zarfoss, F. Monrose, and A. Terzis, “A Multifaceted Approach to Un-
derstanding the Botnet Phenomenon,” inProceedingsof ACM SIGCOMM/USENIX
InternetMeasurementConference(IMC), Oct., 2006, pp. 41–52.
[5] C. Nunnery and B. B. Kang, “Locating Zombie Nodes and Botmasters in Decen-
tralized Peer-to-Peer Botnets,” Available at: honeynet.uncc.edu/papers/P2PDetect
ConceptPaper.pdf, 2007.
148
BIBLIOGRAPHY
[6] “Dos attack cripples internet root servers,” Availableat: http://www.
informationweek.com/story/showArticle.jhtml?articleID=197003903.
[7] J. Stewart, “Storm Worm DDoS Attack,” Available at: http://www.secureworks.com/
research/threats/storm-worm, Feb. 2007.
[8] RSnake, “XSS Cheat Sheet,” Available at: http://ha.ckers.org/xss.html.
[9] F. Boldewin, “Peacomm.C - Cracking the Nutshell,” Available at: http://www.
reconstructer.org/papers/Peacomm.C-Crackingthenutshell.zip.
[10] P. Porras, H. Sadi, and V. Yegneswaran, “A Multi-perspective Analysis of the
Storm (Peacomm) Worm,” Available at: http://www.cyber-ta.org/pubs/StormWorm/
report/, Oct. 2007.
[11] P. Maymounkov and D. Mazieres, “Kademlia: A peer-to-peer information system
based on the XOR metric,” inProceedingsof the Sixth InternationalWorkshop
on Peer-to-PeerSystems(IPTPS), 2002. [Online]. Available: citeseer.ist.psu.edu/
maymounkov02kademlia.html
[12] The Honeynet Project & Research Alliance, “Know Your Enemy:Fast-Flux Ser-
vice Networks, An Ever Changing Enemy,” Available at: http://www.honeynet.org/
papers/ff/index.html, Jul. 2007.
[13] “Storm Worm retaliates against security researchers,” Available at: http://www.
theregister.co.uk/2007/10/25/stormworm backlash/, Oct. 2007.
149
BIBLIOGRAPHY
[14] G. Research, “ Forecast: Mobile Terminals, Worldwide,2000-2009 (4Q05 Update),”
Available at http://www.gartner.com/DisplayDocument?doc cd=137396, Jan. 2006.
[15] “Zotob Causes Carnage in Corporate Networks,” available from: http://www.
netfastusa.com/xq/asp/id.1338/p.5-6-1/qx/PressReleaseview.htm.
[16] “Same Origin Policy,” Available at: http://www.mozilla.org/projects/security/
components/same-origin.html.
[17] C. Patridge, T. Mendez, and W. Milliken, “Host anycasting service,”RFC 1546,
1993.
[18] T. Griffin and G. Wilfong, “An Analysis of BGP Convergence Properties,” inProc.of
ACM SIGCOMM, September 1999.
[19] G. Appenzeller, I. Keslassy, and N. McKeown, “Sizing router buffers,” in
Proceedingsof ACM SIGCOMM, Aug. 2004. [Online]. Available: citeseer.ist.psu.
edu/article/appenzeller04sizing.html
[20] A. Kuzmanovic and E. Knightly, “Low-rate TCP-targeteddenial of service attacks
(the shrew vs. the mice and elephants,” inProceedingsof ACM SIGCOMM, Aug.
2003. [Online]. Available: citeseer.ist.psu.edu/kuzmanovic03lowrate.html
[21] R. Mahajan, S. Floyd, and D. Wetherall, “Controlling high-bandwidth flows at the
congested router,” inICNP, Nov. 2001. [Online]. Available: citeseer.ist.psu.edu/
article/mahajan01controlling.html
150
BIBLIOGRAPHY
[22] J. Grizzard, V. Sharma, C. Nunnery, B. Kang, and D. Dagon, “Peer-to-Peer Botnets:
Overview and Case Study,” inProceedingsof the first USENIX workshopon Hot
Topicsin Botnets(HotBots’07), Apr. 2007.
[23] J. Liang, N. Naoumov, and K. W. Ross, “The Index Poisoning Attack in P2P File
Sharing Systems,” inProcedingsof the 25th IEEE InternationalConferenceon
ComputerCommunications,INFOCOM, 2006.
[24] R. Brunner, “A performance evaluation of the kad-protocol,” Masters Thesis. Cor-
porate Communications Department. Institut Eurocom, France, Nov. 2006.
[25] X. Jiang, D. Xu, H. J. Wang, and E. H. Spafford, “Virtual Playgrounds for Worm
Behavior Investigation,” inProceedingsof the Eighth InternationalSymposiumon
RecentAdvancesin IntrusionDetection(RAID), Sep. 2005.
[26] F. Bellard, “Qemu, a fast and portable dynamic translator,” in Proceedingsof the
USENIX AnnualTechnicalConference,FREENIX Track., 2005.
[27] M. Steiner, T. En-Najjary, and E. W. Biersack, “A globalview of KAD,” in
Proceedingsof theInternetMeasurementConference,IMC, 2007.
[28] B. Sterling, “Microsoft Battles the Storm Worm,” Available at: http://blog.wired.
com/sterling/2007/09/microsoft-battl.html, Sep. 2007.
[29] MaxMind LLC, “MaxMind GeoIP Country Database,” Available at http://www.
maxmind.com/, 2007.
151
BIBLIOGRAPHY
[30] G. Keizer, “Massive spam shot of ’Storm Trojan’ reachesrecord pro-
portions,” Available at: http://computerworld.com/action/article.do?command=
viewArticleBasic&articleId=9016420, 2007.
[31] A. Ramachandran and N. Feamster, “Understanding the network-level behavior of
spammers,”SIGCOMMComput.Commun.Rev., vol. 36, no. 4, pp. 291–302, 2006.
[32] “Composite Blocking List,” Available at: http://cbl.abuseat.org/.
[33] D. C. Hart, “Real Time DNSBL and Spam Trap,” Available at: http://tqmcube.com/.
[34] Admins WebSecurity GbR, “Germany’s first Spam Protection Database,” Available
at: http://www.uceprotect.net/en/index.php/.
[35] D. Stutzbach and R. Rejaie, “Understanding churn in peer-to-peer networks,” in
Proceedingsof the6th InternetMeasurementConference(IMC). New York, NY,
USA: ACM Press, 2006, pp. 189–202.
[36] “Storm worm now just a squall,” Available at: http://www.washingtonpost.com/
wp-dyn/content/article/2007/10/22/AR2007102200021pf.html.
[37] “Measuring the success rate of storm worm,” Available at: http://honeyblog.org/
archives/156-Measuring-the-Success-Rate-of-Storm-Worm.html.
[38] “Macosx malware latches onto bluetooth vulnerability,” Available at http://www.
theregister.co.uk/2006/02/17/macosxbluetoothworm, 2006.
152
BIBLIOGRAPHY
[39] “CRAWDAD: A community resource for archiving wirelessdata at Dartmouth,”
available from: http://crawdad.cs.dartmouth.edu/dartmouth/campus.
[40] D. Moore, “Network Telescopes: Observing Small or Distant Security Events,” in
11th USENIX SecuritySymposium,Invited Talk, Aug. 2002.
[41] Y. Wang, D. Chakrabarti, C. Wang, and C. Faloutsos, “Epidemic spreading
in real networks: An eigenvalue viewpoint,” in22nd Symposiumon Reliable
DistributedComputing,Florence,Italy, Oct. 6-8, 2003., 2003. [Online]. Available:
citeseer.ist.psu.edu/wang03epidemic.html
[42] M. Balazinska and P. Castro, “Characterizing Mobilityand Network Usage in a Cor-
porate Wireless Local-Area Network,” in1st InternationalConferenceon Mobile
Systems,Applications,andServices(MobiSys), San Francisco, CA, May 2003.
[43] R. Jain, A. Shivaprasad, D. Lelescu, and X. He, “Towardsa model of user mobility
and registration patterns,”SIGMOBILE Mob.Comput.Commun.Rev., vol. 8, no. 4,
pp. 59–62, 2004.
[44] S. Eubank, V. S. A. Kumar, M. V. Marathe, A. Srinivasan, and N. Wang, “Structural
and algorithmic aspects of massive social networks,” inProceedingsof thefifteenth
annualACM-SIAM symposiumon Discretealgorithms, 2004, pp. 718–727.
[45] S. Staniford, D. Moore, V. Paxson, and N. Weaver, “The Top Speed of Flash Worms,”
153
BIBLIOGRAPHY
in Proceedingsof theACM Workshopon RapidMalcode(WORM), Oct. 2004, pp.
33–42.
[46] M. Bailey, E. Cooke, F. Jahanian, J. Nazario, and D. Watson, “Internet motion
sensor: A distributed blackhole monitoring system,” inProceedingsof the ISOC
NetworkandDistributedSystemSecuritySymposium(NDSS), 2005.
[47] M. A. Rajab, F. Monrose, and A. Terzis, “On the effectiveness of Distributed Worm
Monitoring,” in Proceedingsof UsenixSecurity, 2005.
[48] “The CAIDA Dataset on the Witty Worm - March 19-24, 2004,Colleen Shannon
and David Moore, http://www.caida.org/passive/witty/. Support for the Witty Worm
dataset and the UCSD Network Telescope are provided by CiscoSystems, Limelight
Networks, DHS, NSF, CAIDA, DARPA, Digital Envoy, and CAIDA Members.”
[49] M. A. Rajab, F. Monrose, and A. Terzis, “Fast and EvasiveAttacks: Highlighting
the challenges ahead,” inProceedingsof the9th InternationalSymposiumonRecent
Advancesin IntrusionDetection(RAID), Sep. 2006.
[50] H. Hethcote, “The Mathematics of Infectious Diseases,” in SIAM Reviews,Vol. 42
No. 4, 2000.
[51] Z. Chen, L. Gao, and K. Kwiat, “Modeling the Spread of Active Worms,” in
Proceedingsof IEEE INFOCOMM, vol. 3, 2003, pp. 1890 – 1900.
[52] G. S. Canright and K. Engo-Monsen, “Epidemic Spreadingover Networks - A View
154
BIBLIOGRAPHY
from Neighbourhoods,”Telektronikk, vol. 2005, no. 1, 2005, available at: http://
www.telenor.com/telektronikk/volumes/pdf/1.2005/Page 065-085.pdf.
[53] S. Staniford, V. Paxson, and N. Weaver, “How to 0wn the internet in your spare
time,” in Proceedingsof the11th USENIX SecuritySymposium, Aug. 2002.
[54] E. Anderson, K. Eustice, S. Markstrum, M. Hansen, and P.Reiher, “Mobile con-
tagion: Simulation of infection and defense,” inProceedingsof the19th Workshop
on Principlesof AdvancedandDistributedSimulation(PADS). Washington, DC,
USA: IEEE Computer Society, 2005, pp. 80–87.
[55] J. Su, K. W. Chan, A. G. Miklas, K. Po, A. Akhavan, S. Saroiu, E. de Lara, and
A. Goel, “A Preliminary Investigation of Worm Infections ina Bluetooth Environ-
ment,” in4th Workshopon RapidMalcode, 2006.
[56] J. W. Mickens and B. D. Noble, “Modeling epidemic spreading in mobile environ-
ments,” inWiSe ’05: Proceedingsof the4th ACM workshopon Wirelesssecurity.
New York, NY, USA: ACM Press, 2005, pp. 77–86.
[57] J.-K. Lee and J. C. Hou, “Modeling steady-state and transient behaviors of user
mobility:: formulation, analysis, and application,” inMobiHoc ’06: Proceedings
of the seventhACM internationalsymposiumon Mobile ad hoc networkingand
computing. New York, NY, USA: ACM Press, 2006, pp. 85–96.
155
BIBLIOGRAPHY
[58] “Cisco network admission control,” CISCONAC: Available at http://www.cisco.
com/en/US/netsol/ns466/networkingsolutionspackage.html.
[59] C. Weaver, D. Ellis, S. Staniford, and V. Paxson, “Wormsvs Perimeters: The Case
for Hard-LANs,” in Proceedingsof the 12th Annual IEEE Symposiumon High
PerformanceInterconnects, 2004.
[60] D. Whyte, E. Kranakis, and P. V. OorSchot, “ARP-Based Detection of Scanning
Worms within an Enterprise Network,” inProceedingsof the Annual Computer
SecurityApplicationsConference(ACSAC), 2005.
[61] S. E. Schechter, J. Jung, and A. W. Berger, “Fast detection of scanning worm infec-
tions,” in Proceedingsof the 7th InternationalSymposiumon RecentAdvancesin
IntrusionDetection(RAID), 2004.
[62] S. Sarat and A. Terzis, “On Using Mobility to Propagate Malware,” inProceedings
of the 5th InternationalSymposiumon Modeling andOptimizationin Mobile, Ad
Hoc,andWirelessNetwork(WiOpt), Apr. 2007.
[63] Y. Xie, V. Sekar, D. A. Maltz, M. K. Reiter, and H. Zhang, “Worm Origin Iden-
tification Using Random Moonwalks,” inProceedingsof the IEEE Symposiumon
SecurityandPrivacy, May 2005, pp. 242–256.
[64] Y. Xie, V. Sekar, M. K. Reiter, and H. Zhang, “Forensic Analysis for Epidemic At-
156
BIBLIOGRAPHY
tacks in Federated Networks,” inProceedingsof theIEEE InternationalConference
on NetworkProtocols, Oct. 2006.
[65] J. Bethencourt, J. Franklin, and M. Vernon, “Mapping Internet Sensors with Probe
Response Attacks,” inProceedingsof the14th USENIX SecuritySymposium, Aug.
2005, pp. 193–212.
[66] F. Campos, M. Karaliopoulos, M. Papadopouli, and H. Shen, “Spatio-Temporal
Modeling of Traffic Workload in a Campus WLAN,” inProceedingsof theSecond
AnnualInternationalWirelessInternetConference, Boston, USA, 2006.
[67] C. Shannon and D. Moore, “The Spread of the Witty Worm,”IEEE Securityand
PrivacyMagazine, vol. 2, no. 4, pp. 46–50, Jul. 2004.
[68] A.-H. Kim and B. Karp, “Autograph: Toward Automated, Distributed Worm Signa-
ture Detection,” inProceedingsof the 13th UsenixSecuritySymposium(Security
2004), 2004.
[69] S. Singh, C. Estan, G. Varghese, and S. Savage, “Automated Worm Fingerprinting,”
in Proceedingsof the6th ACM/USENIX Symposiumon OperatingSystemDesign
andImplementation(OSDI), 2004.
[70] P. Akritidis, W. Chin, V. Lam, S. Sidiroglou, and K. Anagnostakis, “Proxim-
ity Breeds Danger: Emerging Threats in Metro-area WirelessNetworks,” in
Proceedingsof the16th USENIX SecuritySymposium, 2007.
157
BIBLIOGRAPHY
[71] “W32.Witty.Worm,” Mar. 2004, available from: http://securityresponse.symantec.
com/avcenter/venc/data/w32.witty.worm.html.
[72] A. Kumar, V. Paxson, and N. Weaver, “Exploiting Underlying Structure for Detailed
Reconstruction of an Internet-scale Event,” inProceedingsof Usenix1st Workshop
on Stepsto ReducingUnwantedTraffic on theInternet(SRUTI), 2005.
[73] M. Casado, T. Garfinkel, M. Freedman, A. Akella, D. Boneh, N. McKeowon,
and S. Shenker, “SANE: A Protection Architecture for Enterprise Networks,” in
Proceedingsof the15th UsenixSecuritySymposium, August 2006.
[74] M. Casado, M. Freedman, J. Pettit, J. Luo, N. McKeowon, and S. Shenker,
“ETHANE: Taking Control of the Enterprise,” inProceedingsof ACM SIGCOMM,
2007.
[75] “The Samy Worm,” Available at: http://namb.la/popular/tech.html, October 2005.
[76] “Konqueror Web Browser,” Available at: http://www.konqueror.org/features/
browser.php.
[77] A. Moshchuk, T. Bragin, and D. Deville, “Spyproxy: Execution-based Detection
of Malicious Web Content,” inProceedingsof the SixteenthUSENIX Security
Symposium, 2007.
[78] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, and N. Modadugu, “The
158
BIBLIOGRAPHY
Ghost in the Browser: Analysis of Web-based Malware,” inProceedingsof thefirst
USENIX workshopon Hot Topicsin Botnets(HotBots’07), Apr. 2007.
[79] D. Crockford, “JSONRequest,” Available at: http://www.json.org/module.html.
[80] C. Jackson and H. J. Wang, “Subspace: Secure Cross-Domain Communica-
tion for Web Mashups,” inProceedingsof the Sixteenth World Wide Web
Conference(WWW), May 2007.
[81] T. Jim, N. Swamy, and M. Hicks, “Defeating script injection attacks with browser-
enforced embedded policies,” inWWW ’07: Proceedingsof the16th international
conferenceon World WideWeb. New York, NY, USA: ACM, 2007, pp. 601–610.
[82] H. J. Wang, X. Fan, J. Howell, and C. Jackson, “Protection and communication ab-
stractions for web browsers in mashupos,” inSOSP’07: Proceedingsof twenty-first
ACM SIGOPSsymposiumonOperatingsystemsprinciples. New York, NY, USA:
ACM, 2007, pp. 1–16.
[83] V. T. Lam, S. Antonatos, P. Akritidis, and K. G. Anagnostakis, “Puppetnets: misus-
ing web browsers as a distributed attack infrastructure,” in CCS’06: Proceedingsof
the13thACM conferenceon Computerandcommunicationssecurity. New York,
NY, USA: ACM, 2006, pp. 221–234.
[84] T. Hardie, “Distributing authoritative name servers via shared unicast addresses,”
RFC3258, Apr. 2002.
159
BIBLIOGRAPHY
[85] S. Sarat and A. Terzis, “On the Use of Anycast in DNS,” HiNRG, Johns Hopkins
University Technical Report, Tech. Rep., Dec 2004.
[86] I. Research, “Planet Lab,” 2002, http://www.planet-lab.org/.
[87] R. Elz, R. Bush, S. Bradner, and M. Patton, “Selection and Operation of Secondary
DNS Servers,” Jul. 1997.
[88] Y. Rekhter and T. Li, “A Border Gateway Protocol 4 (BGP-4),” RFC1771, March
1995.
[89] J. Abley, “A Software Approach to Distributing Requests for DNS Service
Using GNU Zebra, ISC BIND 9, and FreeBSD,” inProceedingsof USENIX
2004 Annual TechnicalConference,FREENIX Track, 2004. [Online]. Available:
http://www.usenix.org/events/usenix04/tech/sigs/abley.html
[90] R. Arends, R. Austein, M. Larson, D. Massey, and S. Rose,“DNS Security In-
troduction and Requirements,” Work in progress: draft-ietf-dnsext-dnssec-intro-08,
December 2003.
[91] I. S. C. Inc, “ISC F-Root,” http://www.isc.org/ops/f-root/.
[92] J. Abley, “Hierarchical Anycast for Global Service Distribution,” 2003, http://www.
isc.org/pubs/tn/?tn=isc-tn-2003-1.html.
[93] “RIPE NCC K-Root,” http://k.root-servers.org/.
160
BIBLIOGRAPHY
[94] “The Route Views Project,” available at http://www.antc.uoregon.edu/route-views/.
[95] D. Wessels, M. Fomenkov, N. Brownlee, and K. Claffy, “Measurements and Labo-
ratory Simulations of the Upper DNS Hierarchy,” inProceedingsof PAM 2004, Apr.
2004.
[96] C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian, “Delayed internet routing conver-
gence,” inProceedings of ACM SIGCOMM 2000, 2000, pp. 175–187.
[97] S. Savage, A. Collins, E. Hoffman, J. Snell, and T. Anderson, “The End-to-End
Effects of Internet Path Selection,” inProceedingsof SIGCOMM1999, Aug. 1999.
[98] N. Spring, R. Mahajan, and T. Anderson, “Quantifying the Causes of Path
Inflation,” in Proceedingsof ACM SIGCOMM, Aug. 2003. [Online]. Available:
http://www.acm.org/sigcomm/sigcomm2003/papers/p113-spring.pdf
[99] L. Gao and F. Wang, “The extent of AS path inflation by routing policies,”
in Proceedings of Global Internet Symposium, 2002, 2002. [Online]. Available:
citeseer.ist.psu.edu/gao02extent.html
[100] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of
the internet topology,” inSIGCOMM, 1999, pp. 251–262. [Online]. Available:
citeseer.ist.psu.edu/michalis99powerlaw.html
[101] P. B. Danzig, K. Obraczka, and A. Kumar, “An Analysis ofWide-Area Name Server
Traffic,” in ACM SIGCOMM’92, 1992.
161
BIBLIOGRAPHY
[102] N. Brownlee and I. Ziedins, “Response time distributions for global name servers,”
in Proceedingsof PAM 2002Workshop, Mar. 2002.
[103] J. Pang, J. Hendricks, A. Akella, S. Seshan, and B. M. and R. Prisco, “Avail-
ability, Usage and Deployment Characterisitics of the Domain Na me System,” in
Proceedings of the ACM IMC 04, 2004.
[104] P. Boothe and R. Bush, “Dns anycast stability: Some early results,” Available at
http://rip.psg.com/˜randy/050223.anycast-apnic.pdf,2005.
[105] J. Jung, E. Sit, H. Balakrishnan, and R. Morris, “DNS Performance and the Effec-
tiveness of Caching,”IEEE/ACM Trans.on Networking, Oct. 2002.
[106] “The AS112 Project,” http://www.as112.net.
[107] C. Huitema, “An Anycast Prefix for 6to4 Relay Routers,”RFC3068, June 2001.
[108] B. R. Greene and D. Mcpherson, “ISP Security: Deploying and Using Sinkholes,”
http://www.nanog.org/mtg-0306/sink.html.
[109] R. Engel, V. Peris, and D. Saha, “Using IP Anycast for Load distribution and Server
Location,” inProceedingsof GlobalInternet, Dec. 1998.
[110] H. Ballani and P. Francis, “Towards a deployable IP Anycast Service,” in
Proceedingsof WORLDS, Dec. 2004.
[111] K. Park, V. S. Pai, L. Peterson, and Z. Wang, “CoDNS Improving DNS Performance
and Reliability via Cooperative Lookups,” inProceedingsof OSDI’04, Dec. 2004.
162
BIBLIOGRAPHY
[112] V. Ramasubramanian and E. G. Sirer, “The Design and Implementation of a Next
Generation Name Service for the Internet,” inProceedingsof ACM SIGCOMM
2004, Aug. 2004.
[113] C. Villamizar and C. Song, “High performance TCP in ANSNET,” SIGCOMM
ComputerCommunicationsReview, vol. 24, no. 5, pp. 45–60, 1994.
[114] S. Floyd and V. Jacobson, “Random Early Detection gateways for Congestion
Avoidance,” IEEE/ACM Transactionson Networking, vol. 1, no. 4, pp. 397–413,
1993. [Online]. Available: citeseer.ist.psu.edu/floyd93random.html
[115] M. Allman and V. Paxson, “On estimating end-to-end network path properties,” in
Proceedingsof ACM SIGCOMM, Aug. 1999. [Online]. Available: citeseer.csail.
mit.edu/allman99estimating.html
[116] S. Floyd and E. Kohler, “Internet research needs better models,” inProceedingsof
HotNets-I, Oct. 2002. [Online]. Available: citeseer.ist.psu.edu/floyd02internet.html
[117] S. Floyd and K. Fall, “Promoting the Use of End-to-End Congestion Control in the
Internet,” IEEE/ACM Transactionson Networking, vol. 7, no. 4, p. 458473, Aug.
1999.
[118] J. Ioannidis and S. M. Bellovin, “Implementing pushback: Router-based defense
against DDoS attacks,” inProceedingsof NDSS, February 2002. [Online].
Available: citeseer.ist.psu.edu/ioannidis02implementing.html
163
BIBLIOGRAPHY
[119] S. Savage, D. Wetherall, A. R. Karlin, and T. Anderson,“Practical network
support for IP traceback,” inProceedings of ACM SIGCOMM, 2000, pp. 295–306.
[Online]. Available: citeseer.ist.psu.edu/savage00practical.html
[120] Y.-K. Kwok, R. Tripathi, Y. Chen, and K. Hwang, “HAWK: Halting Anomalies
with Weighted Choking to Rescue Well-Behaved TCP Sessions from Shrew DoS
Attacks,” USC Internet and Grid Computing Lab, Tech. Rep. 2005-5, Feb. 2005.
[121] “Security bites podcast: Here come the http botnets,”Available at: http://www.news.
com/2324-126403-6225814.html.
[122] P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu, “A study of the short
message service of a nationwide cellular network,” inProceedingsof theACM IMC,
2006.
164
Vita
Sandeep Sarat received the B. Tech degree in Computer Science & Engineering from
the Indian Institute of Technology, Madras in 2001, and enrolled in the Computer Science &
Engineering Ph.D. program at the Johns Hopkins University in 2001. His research focuses
on the measurement, detection and containment of current and emerging security threats
on the Internet. His interests lie in the intersection of networks, operating systems and
security.
Starting in June 2008, Sandeep will work at Google in their New York Office.
165