1 maintaining bernoulli samples over evolving multisets rainer gemulla wolfgang lehner technische...
TRANSCRIPT
![Page 1: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/1.jpg)
1
Maintaining Bernoulli SamplesOver Evolving Multisets
Rainer Gemulla Wolfgang Lehner
Technische Universität Dresden
Peter J. Haas
IBM Almaden Research Center
![Page 2: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/2.jpg)
2
Motivation• Sampling: crucial for information systems
– Externally: quick approximate answers to user queries– Internally: Speed up design and optimization tasks
• Incremental sample maintenance– Key to instant availability– Should avoid base-data accesses
Sample
Data
(updates), deletes, inserts
“Local”
“Remote”
XToo expensive!
![Page 3: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/3.jpg)
3
What Kind of Samples?• Uniform sampling
– Samples of equal size have equal probability– Popular and flexible, used in more complex schemes
• Bernoulli– Each item independently included (prob. = q)– Easy to subsample, parallelize (i.e., merge)
• Multiset sampling (most previous work is on sets)– Compact representation– Used in network monitoring, schema discovery, etc.
ab
acb
aa c
Dataset R Sample S
Bern(q)a
a
ab
![Page 4: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/4.jpg)
4
Outline
• Background– Classical Bernoulli sampling on sets– A naïve multiset Bernoulli sampling algorithm
• New sampling algorithm + proof sketch– Idea: augment sample with “tracking counters”
• Exploiting tracking counters for unbiased estimation– For dataset frequencies (reduced variance)– For # of distinct items in the dataset
• Subsampling algorithm• Negative result on merging• Related work
![Page 5: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/5.jpg)
5
Classical Bernoulli sampling
• Bern(q) sampling of sets– Uniform scheme– Binomial sample size
• Originally designed for insertion-only• But handling deletions from R is easy
– Remove deleted item from S if present
{ } ( ; , ) (1 )R nnRP S n B n R q q q
n
![Page 6: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/6.jpg)
6
Multiset Bernoulli Sampling
• In a Bern(q) multiset sample:- Frequency X(t) of item t T is Binomial(N(t),q)
- Item frequencies are mutually independent
• Handling insertions is easy– Insert item t into R and (with probability q) into S– I.e., increment counters (or create new counters)
• Deletions from a multiset: not obvious– Multiple copies of item t in both S and R– Only local information is available
![Page 7: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/7.jpg)
7
A Naïve Algorithm
Sample
DataDeletion of t
Delete t from sample
With prob. X(t) / N(t)
X(t) copies of item t in sample
N(t) copies of item t in dataset
– Problem: must know N(t)• Impractical to track N(t) for every distinct t in dataset• Track N(t) only for distinct t in sample?
– No: when t first enters, must access dataset to compute N(t)
Insertion of t
Insert t into sample
With prob. q
![Page 8: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/8.jpg)
8
New Algorithm• Key idea: use tracking counters (GM98)
– After j-th transaction, augmented sample Sj is
Sj = { (Xj (t),Yj (t)): t T and Xj (t) > 0}• Xj(t) = frequency of item t in the sample
• Yj(t) = net # of insertions of t into R since t joined sample
Deletion of t
Delete t from sample
With prob. (Xj(t) – 1) / (Yj(t) – 1)
Sample
Data
Xj(t) copies of item t in dataset
Nj(t) copies of item t in dataset
Insertion of t
Insert t into sample
With prob. q
![Page 9: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/9.jpg)
9
A Correctness Proofi Ri Si
i*-1 {t t} { }
i* {t t t} {t}
i*+1 {t t t t} {t t}
… … …
j {t t t t … t} {t t … t}Yj - 1 items Xj - 1 items
![Page 10: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/10.jpg)
10
A Correctness Proofi Ri Si
( | ) ( 1| 1) ( 1; 1, )1 1jj j jP X k Y m P k Y m B mX k q
i*-1 {t t} { }
i* {t t t} {t}
i*+1 {t t t t} {t t}
… … …
j {t t t t … t} {t t … t}Yj - 1 items Xj - 1 items
Red sample obtained from red dataset via naïve algorithm, hence Bern(q)
![Page 11: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/11.jpg)
11
Proof (Continued)
• Can show (by induction)
– Intuitive when insertions only (Nj = j)
• Uncondition on Yj to finish proof
(1 ) if 0( )
(1 ) otherwise
j
j
N
j N m
q mP Y m
q q
( ) ( | ) ( )j j j jmP X k P X k Y m P Y m
= B(k-1;m-1,q) by previous slide
![Page 12: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/12.jpg)
12
Frequency Estimation• Naïve (Horvitz-Thompson) unbiased estimator
• Exploit tracking counter:
• Theorem
• Can extend to other aggregates (see paper)
1 1ˆiX
iiXN
q
X
1 if 0ˆ0 if
1
0i
iY
i
i q YN
Y
Y
andˆ ˆ ˆ[ ] [ ] [ ]i i iY i Y XE N N V N V N
![Page 13: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/13.jpg)
13
Estimating Distinct-Value Counts
• If usual DV estimators unavailable (BH+07)• Obtain S’ from S: insert t D(S) with probability
• Can show: P(t S’) = q for t D(R)• HT unbiased estimator: = |S’| / q• Improve via conditioning (Var[E[U|V]] ≤ Var[U]):
1 if ( ) 1( )
if ( ) 1
Y tp t
q Y t
ˆHTD
( )ˆ ˆ[ | ] ( ) /Y HT t D S
D E D S p t q
![Page 14: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/14.jpg)
14
Subsampling
aa a
b b cBern(q) sample of R
aaa
bb
c
a a
bdd
R
• Why needed?– Sample is too large– For merging
• Challenge:– Generate statistically
correct tracking-counter value Y’
• New algorithm– See paper a
a c Bern(q’) sample of R
q’ < q
![Page 15: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/15.jpg)
15
Merging
R1aa a
b b c
S1a
a
c
sample
R2a a
b d d
S2a
d
b
sample
aaa
bb
c
a a
bdd
merge
R = R1 R2
• Easy case– Set sampling or no
further maintenance
– S = S1 S2
• Otherwise:– If R1 R2 Ø and
0 < q < 1, then there exists no statistically correct merging algorithm
a acb S Ra
d
![Page 16: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/16.jpg)
16
Related Work on Multiset Sampling
• Gibbons and Matias [1998]– Concise samples: maintain Xi(t), handles inserts only
– Counting Samples: maintain Yi(t), compute Xi(t) on demand
– Frequency estimator for hot items: Yi(t) – 1 + 0.418 / q
• Biased, higher mean-squared error than new estimator
• Distinct-Item sampling [CMR05,FIS05,Gi01]– Simple random sample of (t, N(t)) pairs– High space, time overhead
![Page 17: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/17.jpg)
17
Maintaining Bernoulli SamplesOver Evolving Multisets
Rainer Gemulla Wolfgang Lehner
Technische Universität Dresden
Peter J. Haas
IBM Almaden Research Center
![Page 18: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/18.jpg)
18
Backup Slides
![Page 19: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/19.jpg)
19
Subsampling• Easy case (no further maintenance)
– Take Bern(q*) subsample, where q* = q’ / q– Actually, just generate X’ directly as
Binomial(X,q*)
• Hard case (to continue maintenance)– Must also generate new tracking-counter
value Y’
• Approach: generate X’ then Y’ | {X’,Y}
![Page 20: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/20.jpg)
20
Subsampling: The Hard Case• Generate X’ = +
= 1 iff item included in S at time i* is retained• P( = 1) = 1 - P( = 0) = q*
is # of other items in S that are retained in S’ is Binomial(X-1,q*)
• Generate Y’ according to “correct” distribution– P(Y’ = m | X’, Y, )
![Page 21: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/21.jpg)
21
Subsampling (Continued)
1
1
' '( ' | ', , 0) 1 when ' 0
Y
i m
X XP Y m X Y X
m i
( ' | ', , 1) [1 if and 0 otherwise ]P Y m X Y m Y
Generate Y - Y’ using acceptance/rejection (Vitter 1984)
i : 1 2 3 4 5 6
Y X6 6 = 5 , = 3
Y X6 6' = 5 , ' = 2
Y X6 6' = 3 , ' = 2
t t
t
t t t tO rig ina l sam ple
t t t t t = 1
= 0 t t t t t t
![Page 22: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/22.jpg)
22
Related Workon Set-Based Sampling
• Methods that access dataset– CAR, CARWOR [OR86], backing samples [GMP02]
• Methods (for bounded samples) that do not access dataset– Reservoir sampling [FMR62, Vi85] (inserts only)– Stream sampling [BDM02] (sliding window)– Random pairing [GLH06] (resizing also discussed)
![Page 23: 1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research](https://reader030.vdocuments.us/reader030/viewer/2022032703/56649d145503460f949e8842/html5/thumbnails/23.jpg)
23
More Motivation:A Sample Warehouse
Full-ScaleWarehouse Of Data Partitions
Sample
Sample
Sample
S1,1 S1,2 Sn,mWarehouseof Samples
merge
S*,* S1-2,3-7 etc