compact data structures: data streamingbecchett/elective/slide/stream.pdf · 30 billions emails, 1...
TRANSCRIPT
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Compact data structures: data streaming
Luca Becchetti
“Sapienza” Universita di Roma – Rome, Italy
April 26, 2012
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
1 What is streaming?
2 Probabilistic excursus
3 Count-Min sketch
4 Heavy hitters
5 Inner product and join
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Motivations
US Bbones [Odlyzko, 2003] [Ipoque GMBH, 2007]
Traffic explosion in past years [Muthukrishnan, 2005a]30 billions emails, 1 billions SMS, IMs daily (2005)≈ 1 billion packets/router x hr
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Motivations
LAN
LAN
INTERNET
Router
SNMP logFlow log
Packet log
Logs
SNMP: (Router ID, Interface ID, Timestamp, Bytes sent sincelast obs.)
Flow: (Source IP, Dest IP, Start Time, Duration, No. Packets,No. Bytes)
(Source IP, Dest IP, Src/Dest Port Numbers, Time, No. Bytes)
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Challenges [Muthukrishnan, 2005a]
1 link with 2 Gb/s. Say avg packet size is 50 bytes
Number of pkts/sec = 5 Million
Time per pkt = 0.2 µsec (time available for processing)
If we capture pkt headers per packet: src/dest IP, time,no of bytes, etc. at least 10 bytes
Space per second is 50 MB. Space per day is 4.5 TB perlink
ISPs have hundreds of links.
Focus is on solutions for real applications
Note: we seek solutions that work in practice → easy toimplement, require small space, allow fast updates and queries
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Some typical queries [Muthukrishnan, 2005a]
How many distinct IP addresses use a given link currentlyor anytime during the day?
What are the top k voluminous flows currently inprogress in a link?
How many distinct flows were observed?
Are traffic patterns in two routers correlated? What are(un)usual trends?
Network monitoring just one possible application
On-line statistics on search engines’ query logs
On-line statistics on server logs
Finding near-duplicate Web pages
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Informally...
Streaming involves ([Muthukrishnan, 2005a]):
Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or number ofstream items?)
A model of computation...
Similar to dynamic, online, approximation or randomizedalgorithms, but with more constraints
Constraints impose limitations that make many “easy”problems hard (further in this lecture)
Being poly-time/poly-space no longer sufficient
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
The streaming model [Muthukrishnan, 2005b]
Processingunit
Uj
Uj+1
Uj+2
Uj+3
Streaming model
Data flow
Data item arrive over time
Uj processed before Uj+1 arrives
Only one pass (or few passes, at most O(log))
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
The streaming model [Muthukrishnan, 2005b]
Underlying data (signal): an n-dimensional array A, ntypically large (e.g., the size of the IP address space)
Update arrive over time. The j-th update is a pairUj = (i , x), where i is an item (index):
Ai = Ai + x
In general, x can be any
Initially: Ai = 0,∀i = 1, . . . n
A(t): the state of the array after the first t updates
Goal
Compute and maintain functions over A in small space,with fast updates and computation
typically, space << n
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Caveats about updates, items and values
j-th update Uj = (i , x)
item i is from a discrete universe of finite size
value x
Example
i = (Source IP, Dest. IP, Protocol)x = packet size (in bytes)
Wlog, i can be considered an integer
Hash to an integer otherwise. Example:
i = (“151.100.12.3”, “210.15.0.2”, “TCP”)
i → h(i), where h(·) maps strings to integers
For example, the integer corresponding to theconcatenation of the strings
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Special cases of the model[Muthukrishnan, 2005b]
j-th update changes A(j) (Time series model)
Cash register: x ≥ 0
Turnstile model: most general model
In this lecture
Compact summaries of data streams (Count-Minsketches [Cormode and Muthukrishnan, 2005])
Statistics: point queries, heavy hitters, join-size, No.distinct items
Only a drop in a sea of results...
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Statistics and aggregates
Point query Q(i): estimate Ai
basic building block for more complex queries
φ-heavy hitters of A: return all i : Ai > φ‖A‖1
Other aggregates (see further)
Join size of two DB relations observed in a streamingfashion
Scalar product of two streams (viewed as two vectors ofsame size)
Number of distinct items observed in A
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Markov’s and Chebyshev’s inequalities
Theorem (Markov’s inequality)
Let X denote a random variable that assumes onlynon-negative values. Then, for every a > 0:
P[X ≥ a] ≤ E[X ]
a.
Theorem (Chebyshev’s inequality)
Let X denote a random variable. Then, for every a > 0:
P[|X − E[X ] | ≥ a] ≤ var [X ]
a2.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Hash functions
- Often used to implement dictionaries- In general, a hash function h : U → [n] maps elements ofsome discrete universe U onto integers belonging to somerange [n] = 0, 1, . . . , n− 1. Typically, |U| >> n. Ideally, themapping should be uniform. We assume without loss ofgenerality that U is some subset of the integers (why can westate this?)
U h(x)
0
n-1
Ideal behaviour: items in U mapped uniformly at random in 0, ... , n-1
x
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Hash functions/2
Mapping should look “random” → If m items are mapped,then every i ∈ [n] should be the image of ≈ m/n items.
E.g.: if U = 0, . . . ,m − 1 consider h(x) = x mod p,with p a suitable prime.
Problem: this works if items from U appear at random→ often many correlations present
Create an adversarial sequence that maps all elements ofthe sequence onto the same i
Main question in many applications: mitigate the impactof adversarial sequences
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Randomizing the hash function
Use a randomly generated hash function to map items tointegers.Idea: even if correlations present, items are mappedrandomly. Ideal behaviour (hard to achieve in practice)
For each x ∈ U, P[h(x) = j ] = 1/n, for everyj = 1, . . . , n
The values h(x) are independent
Caveats
This does not mean that every evaluation of h(x) yields adifferent random mapping, but only that h(x) is equallylikely to take any value in [0, . . . , n − 1]
Not easy to design an “ideal” hash function (many trulyrandom bits necessary)
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Families of universal hash functions
We assume we have a suitably defined family F of hashfunctions, such that every member of h ∈ F is a functionh : U → [n].
Definition
F is a 2-universal hash family if, for any h(·) chosen uniformlyat random from F and for every x , y ∈ U we have:
P[h(x) = h(y)] ≤ 1
n.
Definitions generalizes to k-universality[Mitzenmacher and Upfal, 2005, Section 13.3]
Problem: define “compact” universal hash families
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
A 2-universal family
Assume U = [m] and assume the range of the hash functionswe use is [n], where m ≥ n (typically, m >> n). We considerthe family F defined by hab(x) = ((ax + b) mod p) mod n,where a ∈ 1, . . . , p − 1, b ∈ 0, . . . , p and p is a primep ≥ m.
How to choose u.a.r. from FFor a given p: Simply choose a u.a.r. from 1, . . . , p − 1 andb u.a.r. from 0, . . . , p
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
A 2-universal family/cont.
Theorem ([Carter and Wegman, 1979,Mitzenmacher and Upfal, 2005])
F is a 2-universal hash family. In particular, if a, b are chosenuniformly at random:
P[hab(x) = i ] =1
n,∀x ∈ U, i ∈ [n].
P[hab(x) = hab(y)] ≤ 1
n,∀x , y ∈ U.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
The CM sketch[Cormode and Muthukrishnan, 2005]
In the remainder: cash-register model
2-Dimensional array whose size is determined by designparameters ε and δ (their meaning explained further)
Array is C [j , l ], where j = 1, . . . , d and l = 1, . . . ,w
d =⌈ln 1
δ
⌉(depth)
w =⌈eε
⌉(width)
Every entry initially 0
d hash functions h1, . . . , hd chosen uniformly at randomfrom a 2-universal (pairwise-independent) family (seefirst lecture)
hr : 1, . . . , n → 1, . . . ,w
Update
Pair (i , c) is observed, meaning that, ideally, Ai = Ai + c
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
CM sketch: Update procedure
w
d
hd
h1
i
+c
+c
+c
+c
CM sketch update
update(i, c)
Require: i: array index, c: value
1: for j : 1 . . . d do2: C[j, hj(i)] = C[j, hj(i)] + c
3: end for
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Point query
Basic query, building block for the others
Q(i): estimate Ai
Point query estimate
PQ(i)
Require: i: array index
1: return Ai = minj C[j, hj(i)]
Theorem ([Cormode and Muthukrishnan, 2005])
Ai ≥ Ai . Furthermore, P[Ai > Ai + ε‖A‖1
]≤ δ, where
‖A‖1 =∑n
i=1 |Ai | is the 1-norm of A.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Proof of theorem
Define Iijk = 1 if (i 6= k)⋂
(hj(i) = hj(k)), 0 otherwise
P[hj(i) = hj(k)] ≤ 1w ≤
εe by pairwise independence
Define Xij =∑n
k=1 IijkAk
Xij ≥ 0 and C [j , hj(i)] = Ai + Xij → Ai ≥ Ai
Xij is the error introduced by collisions
E[Xij ] = E[∑n
k=1 IijkAk ] =∑n
k=1 AkE[Iijk ] ≤ εe ‖A‖1
Notice that the only random variables are the Iijk ’s andthe Xij ’s
Furthermore,
P[Ai > Ai + ε‖A‖1
]= P[∀j : C [j , hj(i)] > Ai + ε‖A‖1]
= P[∀j : Ai + Xij > Ai + ε‖A‖1] = P[∀j : Xij > eE[Xij ]]
= Πdj=1P[Xij > eE[Xij ]] < e−d ≤ δ,
where the fifth inequality follows from Markov’s inequality.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Heavy hitters[Cormode and Muthukrishnan, 2005]
φ-heavy hitters of A: i : Ai > φ‖A‖1Fact: No. φ-heavy hitters between 0 and 1/φ
Approximate heavy hitters: accept i such thatAi ≥ (φ− ε)‖A‖1 for some specified ε < φ
We consider the cash-register model
Heavy hitters algorithm: ingredients
CM sketch and point query basic building blocks
Return items whose estimate exceeds φ‖A‖1
Assume cs is the s-th update → ‖A‖1 =∑t
s=1 cs →‖A‖1 can be easily maintained and updated in smallspace
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Heavy hitters: update and query
update(i, c, S, H)
Require: i: array index, c ≥ 0, S: CM sketch, H: heap
1: update(i, c, S) Update CM sketch2: Ai = PQ(i)
3: if Ai > φ‖A‖1 then4: HeapUpdate(i, Ai, H) Insert if i 6∈ H5: end if6: s = HeapMin(H)
7: while PQ(s) ≤ φ‖A‖1 do
8: HeapDelete(s)
9: s = HeapMin(H)
10: end while
Heap and query
Generic heap element: pair (i , Ai ) ordered by Ai
Heavy hitters: return all elements i in H such thatAi > φ‖A‖1
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Heavy hitters: performance
Theorem ([Cormode and Muthukrishnan, 2005])
Assume Inserts only (cash register model). With CM sketches
using space O(
1ε log ‖A‖1
δ
)and update time O
(log ‖A‖1
δ
)per
item:
Every heavy hitter is output
With probability at least 1− δ : i) no item whose realcount is ≤ (φ− ε)‖A‖1 is output and ii) the number of
items in the heap is O(
1φ−ε
)Question
Assume d =⌈eε
⌉and w =
⌈ln n
δ
⌉and let T be the estimated
set of heavy hitters. Recall that Ai ≥ Ai . After any number tof insertions, define Sε = i : Ai < (φ− ε)‖A‖1. Prove that
P[Sε ∩ T 6= ∅] ≤ δ.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Solution
Consider i ∈ Sε. If i ∈ T ,Ai > φ‖A‖1 → ∀j : C [j , hj(i)] > φ‖A‖1. Hence:
Ai < (φ− ε)‖A‖1 → C [j , hj(i)]− Ai > ε‖A‖1,∀j .
This implies:
P[i ∈ T ] = P[Ai > Ai + ε‖A‖1
]= P[∀j : C [j , hj(i)] > Ai + ε‖A‖1] < e−d =
δ
n,
where the third inequality follows from the general result seenfor PQ(i). Finally, since |Sε| ≤ n:
P[Sε ∩ T 6= ∅] = P[∪i∈Sε(i ∈ T )] ≤ δ
n· |Sε| ≤ δ.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Join size
4 2 2 3 4 2
142142
R.A
S.A
1 2 3 4
1 2 3 4
**R.B
R.C
Join of two relations R and S on same attribute A
Tuple
We want to compute COUNT(R 1A S)
Space Ω(n) required (in the picture: n = 4)In the example: COUNT(R 1A S) = 3x2 + 2x2 = 10
Underlying operation: compute scalar product of twonon-negative vectors of same size
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Inner product
Given two streams A and B, we want to keep a compactsummary of A · BWe use two CM sketches CA and CB for A and B
Estimation
Let Sj =∑w
k=1 CA[j , k]CB [j , k]
Estimation: return S ← minj Sj
Theorem ([Cormode and Muthukrishnan, 2005])
S ≥ A · B. Furthermore:
P[S > A · B + ε‖A‖1‖B‖1] ≤ δ.
Q1: Prove theorem above. Proof follows same lines as forpoint query
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Carter, J. L. and Wegman, M. N. (1979).
Universal classes of hash functions.
Journal of Computer and System Sciences, 18(2):143–154.
Cormode, G. and Muthukrishnan, S. (2005).
An improved data stream summary: the count-min sketch andits applications.
J. Algorithms, 55(1):58–75.
Ipoque GMBH, G. (2007).
Internet study 2007. URL: http://www.ipoque.com/.
Mitzenmacher, M. and Upfal, E. (2005).
Probability and Computing : Randomized Algorithms andProbabilistic Analysis.
Cambridge University Press.
Muthukrishnan, S. (2005a).
Data stream algorithms. URL:http://www.cs.rutgers.edu/˜muthu/str05.html.
Streaming
L. Becchetti
What isstreaming?
Probabilisticexcursus
Count-Minsketch
Heavy hitters
Inner productand join
Muthukrishnan, S. (2005b).
Data streams: Algorithms and applications.
In Foundations and Trends in Theoretical Computer Science,Now Publishers or World Scientific, volume 1.
Draft at authors homepage:http://www.cs.rutgers.edu/˜muthu/stream-1-1.ps.
Odlyzko, A. M. (2003).
Internet traffic growth: sources and implications.
In Proc. of SPIE conference on Optical Transmission Systemsand Equipment for WDM Networking, pages 1–15.