data streaming algorithms for estimating entropy of ...jx/reprints/sigm06entropy_talk.pdfdata...

23
Data Streaming Algorithms for Estimating Entropy of Network Traffic Ashwin Lall University of Rochester Vyas Sekar Carnegie Mellon University Mitsunori Ogihara University of Rochester Jun (Jim) Xu Georgia Inst. of Technology Hui Zhang Carnegie Mellon University June 28th, 2006

Upload: duonglien

Post on 13-Apr-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Data Streaming Algorithms for Estimating

Entropy of Network Traffic

Ashwin LallUniversity of Rochester

Vyas SekarCarnegie Mellon University

Mitsunori OgiharaUniversity of Rochester

Jun (Jim) XuGeorgia Inst. of Technology

Hui ZhangCarnegie Mellon University

June 28th, 2006

Page 2: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Outline

• Problem Definition and Applications

• Some Lower Bounds

• The Streaming Algorithm

• The Sieving Algorithm

• Results

• Open Questions, Conclusions, References

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 1

Page 3: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Streaming Model - Notation

• A stream is an ordered tuple over the alphabet

[n] = 1, 2, 3, . . . , n.

• We denote the number of times that the item i is seen by mi and thetotal number of items in the stream by m, i.e., m =

∑n

i=1mi.

• The number of distinct items in the stream (i.e., the number of itemswith non-zero count) is denoted by n0.

• The tth item in the stream is represented by at. So, the stream isrepresented by (a1, a2, a2, . . . , am), where each at ∈ [n].

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 2

Page 4: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Entropy of a Stream

The sample entropy (or just entropy) of a stream is defined to be

H ≡ −n∑

i=1

(mi/m) log (mi/m).

All logarithms are base 2 and, by convention, we define 0 log 0 ≡ 0.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 3

Page 5: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Entropy of a Stream

We will focus on computing the value S ≡∑n

i=1mi log mi and note that

H = −n∑

i=1

mi

mlog (

mi

m)

=−1

m[∑

i

mi log mi −∑

i

mi log m]

= log m −1

m

i

mi log mi

= log m −1

mS,

so that we can compute H from S if we know the value of m.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 4

Page 6: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Motivation

• Entropy is a good measure of “diversity”

• Simple volume-based methods are insufficient to detect anomalies

• Several works have suggested entropy as a good metric to detectanomalies (Feinstein et al., Lakhina et al., Wagner et al., Xu et al.)

• Entropy can be used to detect a diverse spectrum of attacks, e.g.,denial-of-service, distributed denial-of-service, port scans, worm attacks

• Having access to this information allows us to profile the anomalousbehavior

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 5

Page 7: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Some Lower Bounds

Theorem: Any randomized streaming algorithm to compute the exact valueof S when there are at most m items must use Ω(m) bits of space.

Proof Idea: Reduction from the communication complexity problem of setdisjointness.

Theorem: Any deterministic streaming algorithm to approximate S withrelative error less than 1/3 must use Ω(m) bits of space.

Proof Idea: A counting argument to show that every sublinear algorithmmakes an error of at least 1/3 on some input.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 6

Page 8: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

(ε, δ)-Approximation

An (ε, δ)-approximation algorithm for X is one that returns an estimate X ′

with relative error more than ε with probability at most δ. That is

Pr(|X − X ′| ≥ Xε) ≤ δ.

For example, the user may specify ε = 0.05, δ = 0.01 (i.e., at least 99% ofthe time the estimate is accurate to within 5% error). These parametersaffect the space usage of the algorithm, so there is a tradeoff of accuracyversus space.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 7

Page 9: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Algorithm

The strategy will be to sample as follows:

r = rand(1, m) c = 4

1 5 6 11 .................................. 6 ..... 6 .......... 6 ............ 6 ...

t = 1 t = m

and compute the following estimating variable:

X = m(c log c − (c − 1) log (c − 1)).

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 8

Page 10: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Algorithm Analysis

This estimator X = m(c log c− (c−1) log (c − 1)) is an unbiased estimatorof S since

E[X] =m

m

n∑

i=1

mi∑

j=1

(j log j − (j − 1) log (j − 1))

=

n∑

i=1

mi log mi

= S.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 9

Page 11: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Algorithm Analysis, contd.

Next, we bound the variance of X:

V ar(X) = E(X2) − E(X)2 ≤ E(X2)

=m2

m[

n∑

i=1

mi∑

j=2

(j log j − (j − 1) log (j − 1))2]

≤ mn∑

i=1

mi∑

j=2

(2 log j)2 ≤ 4mn∑

i=1

mi log2 mi

≤ 4m log m(∑

i

mi log mi) ≤ 4(∑

i

mi log mi) log m(∑

i

mi log mi)

= 4S2 log m,

assuming that, on average, each item appears in the stream at least twice.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 10

Page 12: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Algorithm contd.

If we compute s1 = 32 log m/ε2 such estimators and compute their averageY , then by Chebyschev’s inequality we have:

Pr(|Y − S| > εS) ≤V ar(Y )

ε2S2

≤4 log mS2

s1ε2S2=

4 log m

s1ε2

≤1

8.

We do this averaging for s2 = 2 log (1/δ) groups. By a Chernoff boundwe can get that at least s2/2 averages have relative error at most εwith probability at least 1 − δ. Hence, the median of averages is an(ε, δ)-approximation.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 11

Page 13: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Sieving Algorithm

• KEY IDEA: Separating out the elephants decreases the variance, andhence the space usage, of the previous algorithm

• Each packet is now sampled with some fixed probability p

• If a particular item is sampled two or more times, it is considered anelephant and its exact count is estimated

• For all items that are not elephants we use the previous algorithm

• The entropy is estimated by adding the contribution from the elephants(from their estimated counts) and the ants (using the earlier algorithm)

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 12

Page 14: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

The Sieving Algorithm, contd.

0 10 20 30 40 50 600.46

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

Epoch

Rela

tive

Cont

ribut

ion

to S

Est

imat

e

ElephantsMice

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 13

Page 15: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Evaluation Traces

- Trace 1 Trace 2 Trace 3Date Feb. 2, 2004 Aug. 5, 2003 Apr. 24, 2003

Where USC Los Netto Mid-sized dept. UNC Access LinkEpoch Length 1 min. 5 min. 1 min.Trace Length 60 min. 60 min. 60 min.

# Packets per Epoch 1.7M 0.5M 2.5M# Distinct Addresses 30,267 2,587 25,565

# Distinct Ports 15,165 4,672 8,080

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 14

Page 16: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Algorithms Compared

• Simple sampling

– Sample packets at rate p– Counts are normalized by 1/p and used to compute entropy

• Sample and Hold (Estan and Varghese)

– Sample positions and keep counts from that point on– Adjust counts by adding 1/p and sum to compute entropy

• Algorithm 1 - The Streaming Algorithm

– Uses AMS-type estimator

• Algorithm 2 - The Sieving Algorithm

– Estimates contribution of elephants and ants separately

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 15

Page 17: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Evaluation Results

0 0.05 0.1 0.15 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative error

Frac

tion

of m

easu

rem

ent e

poch

s

SamplingSample and HoldAlgorithm 1Sieving Algorithm

0 0.05 0.1 0.15 0.2 0.250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative error

Frac

tion

of m

easu

rem

ent e

poch

s

SamplingSample and HoldAlgorithm 1Sieving Algorithm

0 0.02 0.04 0.06 0.08 0.1 0.120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Relative error

Frac

tion

of m

easu

rem

ent e

poch

s

SamplingSample and HoldAlgorithm 1Sieving Algorithm

CDF of the three traces on destination address entropy

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 16

Page 18: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Evaluation Results

Distribution Sample Sample&Hold Algo. 1 SievingDSTADDR 0.198 0.192 0.013 0.013SRCADDR 0.054 0.049 0.017 0.004DSTPORT 0.116 0.109 0.016 0.015SRCPORT 0.069 0.062 0.016 0.005

Trace 3: Mean relative error in S estimate

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 17

Page 19: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Evaluation Results

0 10 20 30 40 50 600.515

0.52

0.525

0.53

0.535

0.54

0.545

0.55

0.555

Epoch

Stan

dard

ized

Entro

py

ActualAlgorithm 1 Estimate

0 10 20 30 40 50 600.52

0.525

0.53

0.535

0.54

0.545

0.55

Epoch

Stan

dard

ized

Entro

py

ActualSieving Estimate

Verifying the accuracy in estimating the standardized entropy

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 18

Page 20: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Open Problems and Conclusions

• There is much to be learnt about streaming algorithms and how theyoutperform traditional sampling methods

• The idea of separating out the elephants and performing computationsfor them and the rest of the stream separately may have a much wideruse

• Are there methods out there to improve our results further? Can we dobetter if we make slightly stronger assumptions about the distribution ofthe stream?

• How hard would it be to maintain the entropy of arbitrary subpopulations?For example, to allow a network operator to study the behavior of asingle source address after the data has been collected.

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 19

Page 21: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

References

• N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the

frequency moments. In Proceedings of the 28th annual ACM Symposium on Theory of

Computation (STOC 1996), New York, NY, USA, 1996. ACM Press.

• A. Chakrabarti, K. Do Ba, S. Muthukrishnan. Estimating entropy and entropy norm

on data streams. In Proceedings of the 23rd International Symposium on Theoretical

Aspects of Computer Science (STACS 2006), Marseille, France, 2006. Springer LNCS.

• B. Kalyanasundaram and G. Schnitger. The probabilistic communication complexity of

set intersection. In SIAM Journal of Discrete Mathematics, 5(4):545-557, 1992.

Thanks for your attention!

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 20

Page 22: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Extra slides - Varying Sampling Rate

0

0.05

0.1

0.15

0.2

0.25

0.3

0 20 40 60 80 100 120

Mea

n re

lativ

e er

ror

Space usage (in KB)

SampleSample&Hold

Algorithm 1Sieving algorithm

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 20 40 60 80 100 120

Max

imum

rela

tive

erro

r

Space usage (in KB)

SampleSample&Hold

Algorithm 1Sieving algorithm

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 21

Page 23: Data Streaming Algorithms for Estimating Entropy of ...jx/reprints/Sigm06entropy_talk.pdfData Streaming Algorithms for Estimating Entropy of Network Tra c Ashwin Lall University of

Extra slides - Threshold for Sieving

2 3 4 50

0.01

0.02

0.03

0.04

0.05

0.06

Threshold for elephant status (k)

Rela

tive

Erro

r

Trace 1Trace 2Trace3

Data Streaming for Estimating Entropy of Network Traffic SIGMETRICS/Performance 2006 22