Download - SNMP & Flow Introduction
-
8/3/2019 SNMP & Flow Introduction
1/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Compact data structures: data streaming
Luca Becchetti
Sapienza Universita di Roma Rome, Italy
April 26, 2008
http://www.dis.uniroma1.it/~becchet/http://www.uniroma1.it/http://www.uniroma1.it/http://www.dis.uniroma1.it/~becchet/ -
8/3/2019 SNMP & Flow Introduction
2/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
1 What is streaming?
2 Count-Min sketch
3 Inner product and join
4 Heavy hitters
-
8/3/2019 SNMP & Flow Introduction
3/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Informally...
Streaming involves ([Muthukrishnan, 2005a]):
Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or number ofstream items?)
A model of computation...
Similar to dynamic, online, approximation or randomizedalgorithms, but with more constraints
Constraints impose limitations that make many easyproblems hard (further in this lecture)
Being poly-time/poly-space no longer sufficient
-
8/3/2019 SNMP & Flow Introduction
4/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Motivations
US Bbones [Odlyzko, 2003] [Ipoque GMBH, 2007]
Traffic explosion in past years [Muthukrishnan, 2005a]
30 billions emails, 1 billions SMS, IMs daily (2005) 1 billion ackets router x hr
-
8/3/2019 SNMP & Flow Introduction
5/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Motivations
LAN
LAN
INTERNET
Router
SNMP logFlow log
Packet log
Logs
SNMP: (Router ID, Interface ID, Timestamp, Bytes sent sincelast obs.)
Flow: (Source IP, Dest IP, Start Time, Duration, No. Packets,No. Bytes)
(Source IP, Dest IP, Src/Dest Port Numbers, Time, No. Bytes)
-
8/3/2019 SNMP & Flow Introduction
6/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Challenges [Muthukrishnan, 2005a]
1 link with 2 Gb/s. Say avg packet size is 50 bytesNumber of pkts/sec = 5 Million
Time per pkt = 0.2 sec (time available for processing)
If we capture pkt headers per packet: src/dest IP, time,
no of bytes, etc. at least 10 bytesSpace per second is 50 Mb. Space per day is 4.5 Tb perlink
ISPs have hundreds of links.
Focus is on solutions for real applications
Note: we seek solutions that work in practice easy toimplement, require small space, allow fast updates and queries
-
8/3/2019 SNMP & Flow Introduction
7/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Some typical queries [Muthukrishnan, 2005a]
How many distinct IP addresses use a given link currentlyor anytime during the day?
What are the top k voluminous flows currently inprogress in a link?
How many distinct flows were observed?Are traffic patterns in two routers correlated? What are(un)usual trends?
Network monitoring just one possible application
On-line statistics on search engines query logs
On-line statistics on server logs
...
-
8/3/2019 SNMP & Flow Introduction
8/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
The streaming model [Muthukrishnan, 2005b]
Processingunit
Uj
Uj+1
Uj+2
Uj+3
Streaming model
Data flowData item arrive over time
Uj processed before Uj+1 arrives
Only one pass (or few passes, at most O(log))
-
8/3/2019 SNMP & Flow Introduction
9/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
The streaming model [Muthukrishnan, 2005b]
Underlying data (signal): an n-dimensional array A, n
typically large (e.g., the size of the IP address space)Update arrive over time. The j-th update is a pairUj = (i, x), where i is an index:
Ai = Ai + x
In general, x can be any
Initially: Ai = 0, i = 1, . . . n
A(t): the state of the array after the first t updates
Goal
Compute and maintain functions over A in small space,with fast updates and computation
typically, space
-
8/3/2019 SNMP & Flow Introduction
10/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Special cases of the model[Muthukrishnan, 2005b]
j-th update changes A(j) (Time series model)
Cash register: x 0
Turnstile model: most general model
In this lecture
Compact summaries of data streams (Count-Minsketches [Cormode and Muthukrishnan, 2005])
Applications: point queries, range sums, heavy hitters,quantiles
Only a drop in a sea of results...
-
8/3/2019 SNMP & Flow Introduction
11/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
The CM sketch[Cormode and Muthukrishnan, 2005]
In the sequel: cash-register model
2-Dimensional array whose size is determined by designparameters and (their meaning explained further)
Array is C[j, l], where j = 1, . . . , d and l = 1, . . . , w
d =
ln 1
(depth)w =
e
(width)
Every entry initially 0
d hash functions h1, . . . , hd chosen uniformly at random
from a pairwise-independent family (see first lecture)hr : {1, . . . , n} {1, . . . , w}
Update
Pair (i, c) is observed, meaning that, ideally, Ai = Ai + c
-
8/3/2019 SNMP & Flow Introduction
12/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
CM sketch: Update procedure
w
d
hd
h1
i
+c
+c
+c
+c
CM sketch update
update(i, c)
Require: i: array index, c: value
1: for j : 1 . . . d do2: C[j, hj(i)] = C[j, hj(i)] + c3: end for
-
8/3/2019 SNMP & Flow Introduction
13/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Point query
Basic query, building block for the others
Q(i): estimate Ai
Point query estimate
PQ(i)Require: i: array index
1: return Ai = minj C[j, hj(i)]
Theorem ([Cormode and Muthukrishnan, 2005])
Ai Ai. Furthermore, P
Ai > Ai + A1
, where
A1 =n
i=1 |Ai| is the 1-norm of A.
-
8/3/2019 SNMP & Flow Introduction
14/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Proof of theorem
Define Iijk = 1 if (i = k)
(hj(i) = hj(k)), 0 otherwise
P[hj(i) = hj(k)] 1
w =
e by pairwise independenceDefine Xij =
nk=1 IijkAk
Xij 0 and C[i,j] = Ai + Xij Ai Ai
Xij measures the error introduced by collisions
E[Xij] = E[n
k=1 IijkAk] =n
k=1 AkE[Iijk] eA1
Notice that the only random variables are the Iijks andthe Xijs
Furthermore,
P
Ai > Ai + A1
= P[j : C[j, hj(i)] > Ai + A1]
= P[j : Ai + Xij > Ai + A1] = P[j : Xij > eE[Xij]]
< ed ,
where the fourth inequality follows from Markovs inequality.
-
8/3/2019 SNMP & Flow Introduction
15/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Join size
4 2 2 3 4 2
142142
R.A
S.A
1 2 3 4
1 2 3 4
*
*R.B
R.C
Join of two relations R and S on same attribute A
Tuple
We want to compute COUNT(R 1A S)
Space (n) required (in the picture: n = 4)
In the example: COUNT(R 1A S) = 3x2 + 2x2 = 10
Underlying operation: compute scalar product of two
non-negative vectors of same size
S i
-
8/3/2019 SNMP & Flow Introduction
16/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Inner product
Given two vectors A and B, we want to keep a compact
summary of A B
We use two CM sketches CA and CB for A and B
Estimation
Let Sj =w
k=1 CA[j, k]CB[j, k]Estimation: return S minj Sj
Theorem ([Cormode and Muthukrishnan, 2005])
S A B. Furthermore:
P[S > A B + A1B] .
Q1: Prove theorem above. Proof follows same lines as forpoint query
St i H hi
-
8/3/2019 SNMP & Flow Introduction
17/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Heavy hitters[Cormode and Muthukrishnan, 2005]
-heavy hitters of A: {i : Ai > A1}
No. -heavy hitters: between 0 and 1/
Approximate heavy hitters: accept i such thatAi ( )A1 for some specified <
We consider the cash-register model
Heavy hitters algorithm: ingredients
CM sketch and point query basic building blocks
Return items whose estimate exceeds A1
Assume cs is the s-th update A1 =t
s=1 cs A1 can be easily maintained and updated in smallspace
Streaming H hi d d
-
8/3/2019 SNMP & Flow Introduction
18/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Heavy hitters: update and query
update(i, c, H) {Update CM sketch}
Require: i: array index, c 0, H: heap1: Ai = PQ(i)
2: if Ai > A1 then3: HeapUpdate(i, Ai, H) {Insert if i H}4: end if
5:s = HeapMin(H)
6: while PQ(s) A1 do7: HeapDelete(s)
8: s = HeapMin(H)
9: end while
Heap and query
Generic heap element: pair (i, Ai) ordered by Ai
Heavy hitters: return all elements i in H such that
Ai > A1
Streaming H hi f
-
8/3/2019 SNMP & Flow Introduction
19/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Heavy hitters: performance
Theorem ([Cormode and Muthukrishnan, 2005])
Assume Inserts only (cash register model). With CM sketchesusing space O
1
log n
and update time O
log n
per item:
Every heavy hitter is output
With probability at least 1 : i) no item whose real
count is ( )A1 is output and ii) the number ofitems in the heap is O
1
Question
Assume d =e
and w =
ln n
and let T be the estimatedset of heavy hitters. Recall that Ai Ai. Assume that, afterany number t of insertions: Ai < ( )A1. Prove that
P[i T] .
Streaming S l ti
-
8/3/2019 SNMP & Flow Introduction
20/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Solution
Since i T, we have Ai > A1 j : C[i, hj(i)] > A1.
Hence:
Ai < ( )A1 C[i, hj(i)] Ai > A1, j.
This implies:
P[i T] = P
Ai > Ai + A1
= P[j : C[i, hj(i)] > Ai + A1] < ed =
n,
where the third inequality follows from the general result seenfor PQ(i). Finally, there are at most n items, so:
P[i : i T] .
Streaming
-
8/3/2019 SNMP & Flow Introduction
21/22
Streaming
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Cormode, G. and Muthukrishnan, S. (2005).
An improved data stream summary: the count-min sketch andits applications.
J. Algorithms, 55(1):5875.
Ipoque GMBH, G. (2007).
Internet study 2007. URL: http://www.ipoque.com/.
Muthukrishnan, S. (2005a).Data stream algorithms. URL:http://www.cs.rutgers.edu/muthu/str05.html.
Muthukrishnan, S. (2005b).
Data streams: Algorithms and applications.In Foundations and Trends in Theoretical Computer Science,Now Publishers or World Scientific, volume 1.
Draft at authors homepage:http://www.cs.rutgers.edu/muthu/stream-1-1.ps.
Streaming
-
8/3/2019 SNMP & Flow Introduction
22/22
S g
L. Becchetti
What isstreaming?
Count-Minsketch
Inner productand join
Heavy hitters
Odlyzko, A. M. (2003).
Internet traffic growth: sources and implications.
In Proc. of SPIE conference on Optical Transmission Systemsand Equipment for WDM Networking, pages 115.