scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...

31
Scalable Histograms on Larger Probabilistic Data Mingwang Tang and Feifei Li School of Computing, University of Utah September 21, 2015

Upload: others

Post on 19-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Scalable Histograms on Larger Probabilistic Data

Mingwang Tang and Feifei Li

School of Computing, University of Utah

September 21, 2015

Page 2: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Introduction

New Challenges

Large scale data size

Distributed data sources

Uncertainty

Data synopsis on large probabilistic data

Scalable histograms on large probabilistic data

2 / 31

Page 3: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Histograms on deterministic data

V-optimal histogram: Given a frequency vector −→v = {v1, . . . , vn},where vi is the frequency of item i in [n], a space budget B, it seeksto minimize the SSE error:

min{B∑

k=1

ek∑i=sk

(vi − b̂k)2} (1)

1 2 3 4 5 6 7 8 9 10 11 12

vn = 12, B = 3

Optimal B-bucket histogram takes O(Bn2) time.

3 / 31

Page 4: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

4 / 31

Page 5: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1 : W2

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

5 / 31

Page 6: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

6 / 31

Page 7: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

7 / 31

Page 8: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

tn

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

8 / 31

Page 9: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

tn

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

9 / 31

Page 10: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Histograms on Probabilistic data

Possible world semantic

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1 : W2

gi : frequency of item i becomes random variable across possibleworlds

Expectation based histogram

H(n,B) = min{EW

B∑k=1

ek∑j=sk

(gj − b̂k)2

}.[ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009

[VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009

10 / 31

Page 11: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Efficient computation of bucket error

The optimal B bucket histogram takes O(Bn2) time.

[TKDE10] shows that the minimal error of a bucket b = (s, e, b̂) is:

SSE(b, b̂) =e∑

i=s

EW [g2i ]−

1

e − s + 1EW [

e∑i=s

gi ]2. (3)

by setting b̂ = 1e−s+1EW

[∑ei=s gi

].

Based on two precomputed arrays (A,B), SSE (b, b̂) can becomputed in constant time.

[TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010

11 / 31

Page 12: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

123

1 164

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

12 / 31

Page 13: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

123

1 164

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

13 / 31

Page 14: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

123

1 16

w=2

4

bucket sub-domain boundary

: frequency in W1 : frequency in W2

2 3

w=2

0domain value

123

0

7

: weighted frequency

w=3w=1

w=1w=3 w=3

w=1

domain value

5 6 8 9 10 11 12 13 14 15

partition

merge

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

14 / 31

Page 15: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

123

1 16

w=2

4

bucket sub-domain boundary

: frequency in W1 : frequency in W2

2 3

w=2

0domain value

123

0

7

: weighted frequency

w=3w=1

w=1w=3 w=3

w=1

domain value

5 6 8 9 10 11 12 13 14 15

partition

merge

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

15 / 31

Page 16: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Recursive Merging Method

Pmerge method:

Approximation quality: Pmerge produces a√10 approximation in

O(N + Bn2/m + B3m2) time.

Recursive merging (RPmerge):

Partition [n] into m` subdomains, producing Bm`.Using ` iterations and each iteration reduce the domain size by afactor of m.Takes O(N + B n2

m` + B3 ∑`i=1 m

(i+1)) time and the RPmerge

method gives a 10`2 approximation of the optimal B-buckets

histogram found by OptHist.

In practice, Pmerge and RPmerge always provide close tooptimal approximation quality as shown in our experiments.

16 / 31

Page 17: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Distributed and Parallel Pmerge

Partition phase in the distributed environment

τ` . . .

τ1

τβ

n

1Probabilistic Database D m sub-domains

Communication cost

Computing Ak ,Bk arrays in the partition phase

Tuple model: O(βn) bytes.Value model: O(n) bytes.

O(Bm) bytes in the merge phase for both models.

17 / 31

Page 18: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Distributed and Parallel Pmerge

Partition phase in the distributed environment

τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

τ1

τβ

n

1Probabilistic Database D m sub-domains

Communication cost

Computing Ak ,Bk arrays in the partition phase

Tuple model: O(βn) bytes.Value model: O(n) bytes.

O(Bm) bytes in the merge phase for both models.

18 / 31

Page 19: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Distributed and Parallel Pmerge

Partition phase in the distributed environment

τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

τ1

τβ

A,B

n

1Probabilistic Database D m sub-domains

Communication cost

Computing Ak ,Bk arrays in the partition phase

Tuple model: O(βn) bytes.Value model: O(n) bytes.

O(Bm) bytes in the merge phase for both models.

19 / 31

Page 20: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

20 / 31

Page 21: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

21 / 31

Page 22: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·3 3 3 3 3

item: 1 2 3 4 . . .

22 / 31

Page 23: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·3 3 3 3 3

p = min{Θ(

√β

εN),Θ( 1

ε2N)}

item: 1 2 3 4 . . .

23 / 31

Page 24: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Pmerge Based on Sketch

Tuple model A,B arrays

Estimate F2 =∑j

i=sk(∑β`=1 EW,`[gi ])

2 using AMS Sketchtechniques and binary decomposition of domain [sk , ek ].

s e

F2 = εM ′′k

F2 = 2εM ′′k

F2 = M ′′k

EW [gαk,1 ]EW [gsk ] EW [gek ] EW,ℓ[gsk ] EW,ℓ[gek ]· · · · · ·

(a) binary decomposition (b) local Q-AMS

F2 = εM ′′k

EW [gαk, 1

ε−1

] EW,ℓ[gαk,1 ]EW,ℓ[gαk, 1

ε−1

]

F2 = 2εM ′′k

· · ·· · ·· · ·

AMS

· · ·· · ·· · ·AMS

AMS

24 / 31

Page 25: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Outline

1 Optimal B-buckets Histograms

2 Approximate Histograms

3 Pmerge Based on Sampling

4 Experiments

25 / 31

Page 26: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Experiment setup

Generate tuple model and the value model dataset using the clientid filed of 1998 WorldCup dataset and atmospheric measurementsfrom the SAMOS project.

The default experimental parameters:Symbol Definition Default Value

B number of buckets 400n domain size 100k (600k)` depth of recursions 2

26 / 31

Page 27: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Running time:

n: domain size

10 50 100 150 200

102

104

106

domain size: n (×103)

Tim

e(seconds)

OptHist PMerge

RPMerge

Figure: Tuple Model

10 50 100 150 200

102

104

106

domain size: n (×103)

Tim

e(seconds)

OptHist PMerge

RPMerge

Figure: Value Model

27 / 31

Page 28: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Approximation Ratio:

n: domain size

10 50 100 150 200

1

1.02

1.04

1.06

1.08

domain size: n (×103)

Approxim

ationratio

PMerge RPMerge

Figure: Tuple Model

10 50 100 150 200

1

1.0001

1.0002

1.0003

1.0004

domain size: n (×103)

Approxim

ationratio

PMerge RPMerge

Figure: Value Model

28 / 31

Page 29: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Running time on large scale probabilistic data

n: domain size

200 400 600 800 100010

2

103

104

105

domain size: n (×103)

Tim

e(seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Tuple Model

200 400 600 80010

2

103

104

105

domain size: n (×103)

Tim

e(×

102seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Value Model

29 / 31

Page 30: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

Conclusion & Future work

Conclusion

Novel approximation methods for constructing scalable histogramson large probabilistic data.The quality of the approximate histograms are almost as good as theoptimal histogram in practice.Extended the techniques to distributed and parallel settings tofurther improve scalability.

Future work

extend our study to probabilistic histograms with pdf bucketrepresentatives and handle histogram of other error metrics

30 / 31

Page 31: Scalable Histograms on Larger Probabilistic Datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee ˝1 ˝ n 1 Probabilistic Database D m sub-domains Communication cost Computing A

The end

Thank You

Q and A

31 / 31