data streams - duke university · 2018. 4. 4. · several questions about streams • how do we...

39
Data Streams Everything Data CompSci 216 Spring 2018

Upload: others

Post on 01-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Data Streams

Everything DataCompSci 216 Spring 2018

Page 2: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

How much data is generated every minute in the world ?

2

hAps://fossbytes.com/how-much-data-is-generated-every-minute-in-the-world/

Page 3: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Data stream•  A potentially infinite sequence, where data

arrives one record at a time

•  We only get one look—can’t go back

•  We want to answer a standing query that produces new/updated results as stream goes by

•  We have limited space to remember whatever we deem necessary to answer the query

3

Page 4: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Several questions about streams•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

4

Page 5: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Several questions about streams•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

5

Page 6: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Several questions about streams•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

•  How do we count the number of unique records

seen so far?–  E.g., # of unique visitors (by IP) to a website

6

Page 7: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Outline•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

•  How do we count the number of unique records

seen so far?–  E.g., # of unique visitors (by IP) to a website

7

Page 8: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Outline•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

•  How do we count the number of unique records

seen so far?–  E.g., # of unique visitors (by IP) to a website

8

Page 9: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Sampling static data vs. stream•  With a static dataset–  We know the total data size N–  We can access an arbitrary record

•  With a stream–  There is no N, just the number of records we have

previously seen–  We only get one look of any record, in arrival order

9

Page 10: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Reservoir sampling•  Make one pass over data•  Maintain a reservoir of n records à After reading t records, the reservoir is a random sample of the first t records

•  The algorithm tells us how to update the reservoir upon every new record arrival

10

Page 11: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Simple algorithm•  Initialize reservoir to the first n records•  For the t-th (new) record–  Pick a random number x from {1,…, t}–  If x ≤ n, then replace the x-th record in the reservoir

with the new record

•  That’s it!

11

… 12 18 14 16 23 22 …Stream: t

now

Reservoir: 14 11 9 20 8 18

n

p = n/t

Page 12: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

But why? •  Analyze a simple case: reservoir size n=1•  The t-th (new) record is included with p=1/t •  The probability that the other i-th record r is the

sample from the stream of length t is –  P[r is sampled on arrival] × P[r survives to the end]

= 1/𝑖 ×(1− 1/𝑖+1 )×(1− 1/𝑖+2 )×…×(1− 1/𝑡 )=1/𝑡

•  How about reservoir size n>1 ?

12

Page 13: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

But why?•  Baseline: If t = n, obviously the reservoir has a

“random sample” of all records seen so far•  Induction: Suppose the reservoir is a random

sample of the first t records for t = k – 1–  i.e., P[r in reservoir after k – 1 steps] = n/(k – 1)

•  What happens when t = k?–  The new record is included with prob. n / k–  For any other record r, P[r in reservoir]

= P[r in reservoir after k – 1 steps] × P[r is not replaced in step k] = [n/(k – 1)] × (1 – 1/k) = n / k

13

Page 14: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

An improvement•  What if skipping new records is cheaper than

accessing them one by one?•  After adding the t-th record to reservoir–  Simulate forward until we need to add a new record

in the reservoir; skip until then–  Or calculate the CDF of skip size 𝑃[𝑠𝑘𝑖𝑝 𝑠𝑖𝑧𝑒 ≤𝑠]=1−(1− 𝑛/𝑡+1 )(1− 𝑡+2−𝑛/𝑡+2 )…(1− 𝑛/𝑡+𝑠+1 )•  Sample from this CDF, skip accordingly•  Pick one record in reservoir to replace

14

Page 15: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Summary of reservoir sampling•  Helps create a “running” random sample of

fixed size over a stream

•  Very useful when computing/accessing the whole dataset is expensive

15

Page 16: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Outline•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

•  How do we count the number of unique records

seen so far?–  E.g., # of unique visitors (by IP) to a website

16

Page 17: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Problem boils down to…•  Give a large set S (e.g., all values seen so far), check

whether a given value x is in S

•  Suppose we have n bits of storage (n ≪|S|)–  Cannot afford to store S

17

Page 18: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Approximation comes to rescue•  If x is in S, return true with prob. 1–  i.e., no false negatives

•  If x is not in S, return false with high prob.–  i.e., possible false positives

18

Page 19: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Primitive: hash function•  h: S ⟶ {1, 2, …, n}–  Hashes values uniformly to integers in [1, n],

i.e.: P[h(x) = i] = 1/n

•  “Compressing” a value down with one h loses too much information, so we use k independent hash functions h1, h2, …, hk

19

Page 20: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Bloom filter•  Initialization of bloom filter B (n-bit array)–  Set all n bits of B to 0

•  Add all x in S (all value seen so far)–  Compute h1(x), h2(x), …, hk(x)–  Set the corresponding k bits of B to 1

e.g. k=3

20

0 1 0 0 1 1 0 1 1 0

x1 x2

h3(x1)h1(x1)

h2(x1)

h2(x2)h3(x2)

h1(x2)

Page 21: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Bloom filter•  Initialization of bloom filter B–  Set all n bits of B to 0

•  Add all x in S (all value seen so far)–  Compute h1(x), h2(x), …, hk(x)–  Set the corresponding k bits of B to 1

•  Check if new arrival x is in S–  Compute h1(x), h2(x), …, hk(x)–  Return true iff the corresp. k bits of B are all 1

21

Page 22: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

No false negatives•  If x is really in S–  Then by construction we have set bits h1(x), h2(x), …,

hk(x) to 1–  So check will surely return true

22

0 1 0 0 1 1 0 1 1 0

x1

h3(x1)h1(x1)

h2(x1)

Page 23: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

False positive probability•  If x is not in S–  Check returns true if each bit hj(x) is 1 due to some

other value(s) in S

23

0 1 0 0 1 1 0 1 1 0

x1 x2

h3(x1)h1(x1)

h2(x1)

h2(x2)h3(x2)

h1(x2)

Page 24: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

False positive probability•  If x is not in S–  Check returns true if each bit hj(x) is 1 due to some

other value(s) in S

P[bit i is 1] = 1 – P[bit i was not set by k|S| hashes] = 1 – (1 – 1/n)k|S|

P[k particular bits are 1] = P[bit i is 1] k = (1 – (1 – 1/n)k|S|)k ≈ (1 – e – k|S|/n)k

24

Page 25: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Example•  Suppose there are |S| = 109 elements•  Suppose we have 1 GB (8×109 bits) memory

•  If k = 1– P[false positive] ≈ (1 – e – k|S|/n)k = 1 – e – 1/8

≈ 0.1175•  If k = 2–  P[false positive] ≈ (1 – e – k|S|/n)k = (1 – e – 2/8)2

≈ 0.0493

25

Page 26: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Example•  Suppose there are |S| = 109 elements•  Suppose we have 1 GB (8×109 bits) memory

26

k (# hash functions)

As we increase the # of bits to set to 1 per

element, it is more likely that a bunch of bits

become 1 just by chance

Fals

e po

sitiv

e pr

ob.

Page 27: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Summary of Bloom filter•  Helps check membership in a large set that

cannot be stored entirely

•  No false negatives–  Good for applications like URL shortener

•  False positive probability can be tweaked by the

choice of n and k

27

Page 28: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Outline•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Sample can be used to answer queries

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  E.g., a URL shortening service

•  How do we count the number of unique records

seen so far?–  E.g., # of unique visitors (by IP) to a website

28

Page 29: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Can you use a Bloom filter?•  Increment a counter whenever check returns

false for an incoming value

•  Because of 0 false negative and non-0 false positive probabilities, we will consistently underestimate the # of distinct values

•  Also, the Bloom filter does more than we need—can we use the n bits more efficiently?

29

Page 30: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

FM (Flajolet-Martin) sketchLet Tail0(h(x)) = # of trailing consecutive 0’s•  Tail0(101001) = 0•  Tail0(101010) = 1•  Tail0(001100) = 2•  Tail0(101000) = 3•  Tail0(000000) = 6

30

Page 31: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

FM sketch•  Maintain a value K (max 0-tail length)•  Initialize K to 0•  For each new value x–  Compute Tail0(h(x))–  Replace K with this value it is greater than K

•  F’ = 2K is an estimate of F, the true number of distinct elements

•  K require very liAle space to store

31

Page 32: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Rough intuitionIf we have F distinct elements, we’d expect•  F/2 of them to have Tail0(x) ≥ 1•  F/4 of them to have Tail0(x) ≥ 2 •  …•  F/2K of them to have Tail0(x) ≥𝐾

à P[Tail0(x) ≥𝐾] = 2K/F

So F’ = 2K is preAy good guess of F

32

Page 33: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

How good is the result?•  F: the true number of distinct elements•  F’: guess by FM sketch•  We can show that for all c > 3,

P[F/c ≤ F’ ≤ cF] > 1 – 3/c•  But that’s not very accurate!

33

Page 34: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Use more sketches!•  Use the “median of means” trick

•  Maintain a × b FM sketches–  Use independent hash functions!–  F11, F12, …, F1a

F21, F22, …, F2a ... Fb1, Fb2, ..., Fba

34

Page 35: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Use more sketches!•  Use the “median of means” trick

•  Maintain a × b FM sketches

•  Compute the mean over each group of a–  Gi = mean(Fi1, Fi2, …, Fia)

35

Page 36: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Use more sketches!•  Use the “median of means” trick

•  Maintain a × b FM sketches

•  Compute the mean over each group of a

•  Return the median of b means as answer–  Median(G1,G2,…,Gb) –  Very close to the orginal count F for sufficiently large

a and b

36

Page 37: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Summary of FM sketch•  Helps estimate # of distinct elements in a large

set that cannot be stored entirely

•  Each FM sketch is very rough, but groups of them improve estimation

•  Trick question: do FM sketches support membership check like Bloom filter?–  No — too much error on any particular check–  Specialization gives us beAer efficiency

37

Page 38: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

Summary•  How do we maintain a random sample of size n

for all data we’ve seen so far?–  Reservoir Sampling

•  How do we maintain a data structure to check if

a new arrival has appeared before?–  Bloom Filter

•  How do we count the number of unique records

seen so far?–  FM Sketch

38

Page 39: Data Streams - Duke University · 2018. 4. 4. · Several questions about streams • How do we maintain a random sample of size n for all data we’ve seen so far? – Sample can

SummaryTricks for big data covered in class•  Parallel processing (e.g., MapReduce)•  Approximate processing–  Sampling (downsize data)–  Stream processing (linear time, limited space)

39

Image: http://bigdatapix.tumblr.com/