Worms, Viruses, and Cascading Failures in networks
D. TowsleyU. Massachusetts
Collaborators: W. Gong, C. Zou (UMass) A. Ganesh , L. Massoulie (Microsoft)
Internet as enabler of terrific apps
Internet as enabler of terrific apps
… but also of malicious behavior worms, viruses
Internet as a complex system critical DNS, BGP
infrastructures
Worms and failures Code Red worm
more than 360,000 infected in less than one day
disrupted parts of BGP infrastructure SQL Slammer
less than 15 minutes to infect 75,000 hosts congested parts of Internet
BGP errors in one network → cascade of faults in BGP in another network
Goals
what are appropriate models? deterministic stochastic
what makes worm/virus/failure virulent?
how does topology affect virulence?
Outline
worms, deterministic models cascading failures, stochastic
models summary
Worm spreading behavior
scan for vulnerable hosts sequential, random, topological uniform, local preference
virulence sensitive to scanning strategy host speed, bandwidth protocol …
Worm spreading model
address space, size
N vulnerable hosts
scan rate (per host),
N
Simple worm spreading model
I(t) - number of infected hosts at time t
Epidemic model:
with initial condition I(0)
))t(IN)(t(I)t(I
0.E+00
1.E+05
2.E+05
3.E+05
4.E+05
time
num
ber
inf
ecte
d h
osts
Code Red: model measurements from
two Class A networks scan rate I(t)
epidemic model matches increasing part of observed Code Red data (Staniford)
What about decrease?What about decrease? human
countermeasures congestion
Zou, etal, 20020
100000
200000
300000
400000
500000
600000
2 4 6 8 10 12 14 16 18
# of Scans( Eichman)
Model
04:00 09:00 14:00 19:00 00:000
100,000
200,000
300,000
400,000
500,000
600,000
# of scanstimetime
scan
rat
esc
an r
ate
D. GoldsmithK. Eichman
Assumptions
classic epidemic model ignore countermeasures ignore congestion
Code Red parameters = 358/min N = 360,000
uniform scan, 232
I(0) = 10 100s minutes to spread
0
50000
100000
150000
200000
250000
300000
350000
400000
0 100 200 300 400 500 600
Time (minutes)
No
. in
fect
ed
Worm virulence
increase increase I(0) decrease
Worm virulence
increase increase I(0) decrease smarter scanning
The perfect worm
perfect worm scan vulnerable nodes exactly once
flash worm (Staniford,…)
uniform scan of vulnerable nodes (N)
)()( tIN
tI
))()(()( tINtIN
tI
),min()( / NetI Nt
0.E+00
5.E+04
1.E+05
2.E+05
2.E+05
3.E+05
3.E+05
4.E+05
4.E+05
0 2 4 6 8 10 12 14Time (seconds)
No
. in
fec
ted perfect worm
with delay
flash worm
with delay
Perfect Code Red worm
I(0) = 10 = 358/min N = 360,000 all hosts infected
within 2 sec. add 2 sec. infection
delay -> six-fold slowdown random scan almost
perfect!
I(0) = 10 = 358/min N = 360,000 all hosts infected
within 2 sec. add 2 sec. infection
delay -> six-fold
slowdown random scan almost
perfect!
0.E+00
5.E+04
1.E+05
2.E+05
2.E+05
3.E+05
3.E+05
4.E+05
4.E+05
0 2 4 6 8 10 12 14Time (seconds)
No
. in
fec
ted perfect worm
with delay
flash worm
with delay
Perfect Code Red worm
0.E+00
5.E+04
1.E+05
2.E+05
2.E+05
3.E+05
3.E+05
4.E+05
4.E+05
0 100 200 300 400 500 600
Time (minutes)
No.
infe
cted
Hitlist, routing worms
hitlist worm increases I(0)
routing worm decreases BGP table information:
= .29 232
– 29% of IP address space
Hitlist, routing worms
Code Red style worm
= 358/min N = 360,000 hitlist, I(0) = 10,000
routing worm as effective as hitlist worm
hitlist/routing worm extremely virulent
0
50000
100000
150000
200000
250000
300000
350000
400000
0 100 200 300 400 500 600Time (minutes)
No
. in
fect
ed
Code Red worm
Hit-list worm
Routing worm
Hitlist routing worm
1
Local preference worm
K subnetworks p – probability scan
local subnet (1-p) – prob. scan
outside local subnet 2K
1-p
p
…
Local preference worm
Nk, no. vulnerable hosts in subnet k
Ik(t), no. infected hosts in subnet k
fits epidemic model for interacting groups
set of coupled ODEs
Local preference worm
K = 116 Nk = 360,000/K I1(0) = 10;
Ik(0) = 0, k>1 = 358/min
provides some of the locality of a routing worm
0
50000
100000
150000
200000
250000
300000
350000
400000
0 100 200 300 400 500 600
Time (minutes)
No.
infe
cted
Uniform worm
Routing worm
Preference p=0.1
Preference p=0.5
Preference p=0.99
Questions
topological worms
sequential scan
bandwidth constraints
topology?
failure recovery?
Topology and fast/slow recovery
model description
general network topologies conditions for fast-slow recovery
specific network topologies complete graphs (BGP routers) hypercubes (peer-to-peer networks) power-law graphs (Internet AS graph; E-
mail address book graph)
Susceptible-Infective-Susceptible
(SIS) epidemic model
Also known as contact process; see [Liggett]
topology: undirected, finite graph G=(V,E), connected ;
Xv = 1 if node v down (infected)
Xv = 0 if node v up (healthy)
Model
{Xv vV} Markov process on {0,1}V with jump rates:
Xv → 1 with rate w→vXw
Xv → 0 with rate
unique absorbing state at 0 all other states communicate, 0 is
reachable
Time to absorption system eventually recovers
how long does this take? T = time to hit 0 (from a given
initial condition)
how does E[T] depend on G?
Example
G = line segment or ring with n nodes Fix Theorem (Durrett and Liu): There is
critical c > 0 such that,
if c , then E[T] = O(log n)
if c , then log E[T] ≈ na
signature of phase transition in infinite 1-D lattice.
Fast recovery, spectral radius
- spectral radius of graph adjacency matrix, A; n=|V| .
Then, P(X(t) 0) ≤ c n½ exp([ -]t)
Hence, when < ,
Survival time T satisfies:
E(T) ≤ [log(n)+1]/[ - ]
Coupling proof
Consider “Branching Random Walk”, i.e.Markov process {Yv}vV
Yv → Yv +1 with rate w~v Yw = (AY)v
Yv → Yv -1 with rate Yv
Can couple processes so that, for all t,
X(t) ≤ Y(t).
Branching random walk bound
By “linearity” of Y, dE[Y(t)]/dt = ( A - I) Y(t),soE[Y(t)] = exp( A - I) Y(0) ; Use P(X(t) 0) ≤ vV E[Yv(t)]
Slow recovery
Graph isoperimetric constant:
|||),(|
min2/|:| S
SSEnSS
“perimeter”
“area”S
S
Generalized isoperimetric constant
|||),(|
inf|:| S
SSEmSS
m
Slow die-out and isoperimetric constant
Suppose for some m ≤ n/2, r := [
m] / > 1
Then, with positive probability, epidemics survive for time at least rm/[2m]
Hence, if m = n, survival time T satisfies
log (E[T]) = (na)
Coupling proof
Let |X| = v Xv .
Then |X| dominates process Z on {0,…,m} with transition rates:
z→ z+1 at rate z,z→ z-1 at rate z.
Then study absorption time for Z
Complete graph
Here, = n-1, m = n-m
By picking m = na, a < 1,
Thresholds: fast recovery if / < 1/(n-1)
slow recovery if /> 1/(n-na)
Hypercube {0,1}d
Here, d = log2(n) and = d
For m=2k, k < d, m = d-k
Hence, for k = d, Thresholds: ,
fast recovery if / < 1/d slow recovery if /> 1/[d(1-)]
Erdős-Rényi random graph
edge between each pair of nodes present with probability pn independent of others
dense: dn := npn = Ω(log n)
then ρ ~ ~ dn with high probability
Star network
spectral radius: n1/2
isoperimetric constant: m = 1 for all m < n/2
general results not useful
Specialized analysis yields: for arbitrary constant c > 0, if < c/n1/2,
fast recovery, E[T] = O(log(n))
if /> na-1/2 , for a > 0,slow recovery, log(E[T]) = (na)
Power-law random graph
Power-law graph with exponent : number of degree k vertices k-
E.g. Internet AS graph with = 2.1
Expected degree PLRG [Chung et al]: expected degrees w1 > ··· > wn :
edge (i,j) present w.p. wi wj/k wk
particular choice: wi = c1(i+c2)-1/( -1)
Power-law random graph (2)
Spectral radius of PLRG [Chung et al.,03]:
Denote by m max. expected degree (m=w1), and by d average of expected degrees.
Then:
522
523 .),(
.),()(
m
mA
PLRG, > 2.5
Epidemics on full graph live longer than on sub-graph.
Look at star induced by node 1:slow die-out for / > m-1/2
Compare to spectral radius condition:
Fast die-out for / < m-1/2
Two thresholds differ by m ; same gap as for star
PLRG, 2 < < 2.5
Consider top N nodes, for suitable N;
Erdős-Rényi core, with isoperimetric constant:
= F()
Gap between thresholds and : constant factor, F()
Open problems
gap between upper and lower bounds in sparse ER graphs power law random graphs for < 2.5
spectral radius bound tight in examples, always true?
conditioned on slow recovery, how many nodes are down at intermediate times?
extensions to other graphs and to SIR epidemics
Observations
neither parameter tight gap for topologies with diverse
degrees spectral radius “seems” to be right
nothing between log n and exp(n) ?
Hitlist, routing worms
hitlist worm increase I(0)
routing worm decrease BGP table information: = .29 232
– 29% of IP address space /8 aggregation: = .45 232
– 116 out of 256 possible 8 bit prefixes
0110…0xxx
8
The appearance of phase transitions
N=200, ks =1, kl=0.01Mean time to absorption goes down from 1047 , to about 0 in a matter of few states
Accuracy of fluid model
population: 360,000
scan rate = N(358/min, 1002) normal distr.
scanning space: 232
I(0) =1 100 simulations 0
50000
100000
150000
200000
250000
300000
350000
400000
time (minutes)
No
. In
fect
ed
Mean 95%
5%
Accuracy of fluid model
population: 360,000
scan rate = N(358/min, 1002) normal distr.
scanning space: 232
I(0) =10 100 simulations
0
50000
100000
150000
200000
250000
300000
350000
400000
1 100 200 300 400 500 600 700
time (minutes)
No
. In
fect
ed
Code Red worm
Mean
Accuracy of fluid model
population: 360,000
scan rate = N(358/min, 1002) normal distr.
scanning space: 232
I(0) =10 100 simulations
0
50000
100000
150000
200000
250000
300000
350000
400000
1 100 200 300 400 500 600 700
time (minutes)
No
. In
fect
ed
Mean 95%
5%
Local preference worm
- local scan rate
’- global scan rate
initial conditions Ik(0)
)()(')( tINtItItI kkkj
jkk
Erdős-Rényi random graph
edge between each pair of nodes present with probability pn independent of others
sparse: pn = c log(n)/n, c > 1. then ρ ≤ c’ log(n), ≥ c’’ log(n)
with high probability, for some c’’ < c < c’
dense: dn := npn = Ω(log n)
then ρ ~ ~ dn with high probability