scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...

Scalable Histograms on Larger Probabilistic Data

Mingwang Tang and Feifei Li

School of Computing, University of Utah

September 21, 2015

Introduction

New Challenges

Large scale data size

Distributed data sources

Uncertainty

Data synopsis on large probabilistic data

Scalable histograms on large probabilistic data

2 / 31

Histograms on deterministic data

V-optimal histogram: Given a frequency vector −→v = {v1, . . . , vn},where vi is the frequency of item i in [n], a space budget B, it seeksto minimize the SSE error:

min{B∑

k=1

ek∑i=sk

(vi − b̂k)2} (1)

1 2 3 4 5 6 7 8 9 10 11 12

vn = 12, B = 3

Optimal B-bucket histogram takes O(Bn2) time.

3 / 31

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

4 / 31

Probabilistic Database

Probabilistic database D on domain [n] = {1, . . . , n}

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1 : W2

D = {g1, g2, . . . , gn} where

gi = {(gi (W ),Pr(W ))|W ∈ W} (2)

5 / 31

Probabilistic Data Models

gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

6 / 31


gi = {(gi (W ),Pr(W ))|W ∈ W}

Tuple Model

Each tuple tj =⟨(tj1, pj1), . . . , (tj`j , pj`j )

⟩. Each tjk is drawn from

[n] for k ∈ [1, `j ].

1−∑`jk=1 pjk specify the possibility that tj generates no item.

t1t2t3

{(1, 0.2), (3, 0.3), (7, 0.2)}

t|τ |

{(3, 0.3), (5, 0.1), (9, 0.4)}{(3, 0.5), (10, 0.4), (13, 0.1)}

7 / 31


gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

tn

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

8 / 31


gi = {(gi (W ),Pr(W ))|W ∈ W}

Value Model

Each tuple tj = 〈 j : fj = ((fj1, pj1), . . . , (fj`j , pj`j )) 〉, j is drawn from[n].

Pr(fj = 0) = 1−∑`jk=1 pjk

t1t2t3

tn

{< 3, (10, 0.3), (15, 0.2), (20, 0.5) >}{< 2, (6, 0.4), (7, 0.3), (15, 0.3) >}{< 1, (50, 0.2), (7, 0.1), (14, 0.2) >}

9 / 31

Histograms on Probabilistic data

Possible world semantic

123

1 1642 30

75 6 8 9 10 11 12 13 14 15

frequency

items

: W1 : W2

gi : frequency of item i becomes random variable across possibleworlds

Expectation based histogram

H(n,B) = min{EW

B∑k=1

ek∑j=sk

(gj − b̂k)2

}.[ICDE09] G. Cormode et al., Histograms and wavelets on probabilistic data, ICDE 2009

[VLDB09] G. Cormode et al., Probabilistic histograms for probabilistic data, VLDB 2009

10 / 31

Efficient computation of bucket error

The optimal B bucket histogram takes O(Bn2) time.

[TKDE10] shows that the minimal error of a bucket b = (s, e, b̂) is:

SSE(b, b̂) =e∑

i=s

EW [g2i ]−

1

e − s + 1EW [

e∑i=s

gi ]2. (3)

by setting b̂ = 1e−s+1EW

[∑ei=s gi

].

Based on two precomputed arrays (A,B), SSE (b, b̂) can becomputed in constant time.

[TKDE10] G. Cormode et al., Histograms and wavelets on probabilistic data, TKDE 2010

11 / 31

Pmerge Method

Pmerge method based on partition and merge principle

Partition phase: partition the domain n into m sub-domain of equalsize and compute the local optimal B buckets for each sub-domain.

Merge phase: merge mB input buckets from the partition phase intoB buckets.

123

1 164

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

12 / 31

Pmerge Method




123

1 164

bucket sub-domain boundary2 3

0domain value75 6 8 9 10 11 12 13 14 15

partition

frequency

13 / 31

Pmerge Method




123

1 16

w=2

4

bucket sub-domain boundary

: frequency in W1 : frequency in W2

2 3

w=2

0domain value

123

0

7

: weighted frequency

w=3w=1

w=1w=3 w=3

w=1

domain value

5 6 8 9 10 11 12 13 14 15

partition

merge

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

14 / 31

Pmerge Method




123

1 16

w=2

4

bucket sub-domain boundary

: frequency in W1 : frequency in W2

2 3

w=2

0domain value

123

0

7

: weighted frequency

w=3w=1

w=1w=3 w=3

w=1

domain value

5 6 8 9 10 11 12 13 14 15

partition

merge

1 1642 3 75 6 8 9 10 11 12 13 14 15

frequency

15 / 31

Recursive Merging Method

Pmerge method:

Approximation quality: Pmerge produces a√10 approximation in

O(N + Bn2/m + B3m2) time.

Recursive merging (RPmerge):

Partition [n] into m` subdomains, producing Bm`.Using ` iterations and each iteration reduce the domain size by afactor of m.Takes O(N + B n2

m` + B3 ∑`i=1 m

(i+1)) time and the RPmerge

method gives a 10`2 approximation of the optimal B-buckets

histogram found by OptHist.

In practice, Pmerge and RPmerge always provide close tooptimal approximation quality as shown in our experiments.

16 / 31

Distributed and Parallel Pmerge

Partition phase in the distributed environment

τ` . . .

τ1

τβ

n

1Probabilistic Database D m sub-domains

Communication cost

Computing Ak ,Bk arrays in the partition phase

Tuple model: O(βn) bytes.Value model: O(n) bytes.

O(Bm) bytes in the merge phase for both models.

17 / 31



τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

τ1

τβ

n


Communication cost




18 / 31



τ` . . . ti (i, (E[fi],E[f2i ]))

h(i) = di/ dn/mee

τ1

τβ

A,B

n


Communication cost




19 / 31

Pmerge Based on Sampling

Sampling A,B arrays in the partition phase

Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]

Estimate Ak ,Bk arrays using quantile sampling

2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

20 / 31



Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]


2 3 5 9E[fi] · · ·

item: 1 2 3 4 . . .

21 / 31



Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]


2 3 5 9E[fi] · · ·3 3 3 3 3

item: 1 2 3 4 . . .

22 / 31



Ak [j ] =i∑

i=1

E[f 2j ],Bk [j ] =

j∑i=1

E[fi ]


2 3 5 9E[fi] · · ·3 3 3 3 3

p = min{Θ(

√β

εN),Θ( 1

ε2N)}

item: 1 2 3 4 . . .

23 / 31

Pmerge Based on Sketch

Tuple model A,B arrays

Estimate F2 =∑j

i=sk(∑β`=1 EW,`[gi ])

2 using AMS Sketchtechniques and binary decomposition of domain [sk , ek ].

s e

F2 = εM ′′k

F2 = 2εM ′′k

F2 = M ′′k

EW [gαk,1 ]EW [gsk ] EW [gek ] EW,ℓ[gsk ] EW,ℓ[gek ]· · · · · ·

(a) binary decomposition (b) local Q-AMS

F2 = εM ′′k

EW [gαk, 1

ε−1

] EW,ℓ[gαk,1 ]EW,ℓ[gαk, 1

ε−1

]

F2 = 2εM ′′k

· · ·· · ·· · ·

AMS

· · ·· · ·· · ·AMS

AMS

24 / 31

Outline

1 Optimal B-buckets Histograms

2 Approximate Histograms

3 Pmerge Based on Sampling

4 Experiments

25 / 31

Experiment setup

Generate tuple model and the value model dataset using the clientid filed of 1998 WorldCup dataset and atmospheric measurementsfrom the SAMOS project.

The default experimental parameters:Symbol Definition Default Value

B number of buckets 400n domain size 100k (600k)` depth of recursions 2

26 / 31

Running time:

n: domain size

10 50 100 150 200

102

104

106

domain size: n (×103)

Tim

e(seconds)

OptHist PMerge

RPMerge

Figure: Tuple Model

10 50 100 150 200

102

104

106


Tim

e(seconds)

OptHist PMerge

RPMerge

Figure: Value Model

27 / 31

Approximation Ratio:

n: domain size

10 50 100 150 200

1

1.02

1.04

1.06

1.08


Approxim

ationratio

PMerge RPMerge

Figure: Tuple Model

10 50 100 150 200

1

1.0001

1.0002

1.0003

1.0004


Approxim

ationratio

PMerge RPMerge

Figure: Value Model

28 / 31

Running time on large scale probabilistic data

n: domain size

200 400 600 800 100010

2

103

104

105


Tim

e(seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Tuple Model

200 400 600 80010

2

103

104

105


Tim

e(×

102seconds)

RPMerge Parallel-PMerge

Parallel-RPMerge

Figure: Value Model

29 / 31

Conclusion & Future work

Conclusion

Novel approximation methods for constructing scalable histogramson large probabilistic data.The quality of the approximate histograms are almost as good as theoptimal histogram in practice.Extended the techniques to distributed and parallel settings tofurther improve scalability.

Future work

extend our study to probabilistic histograms with pdf bucketrepresentatives and handle histogram of other error metrics

30 / 31

The end

Thank You

Q and A

31 / 31

scalable histograms on larger probabilistic datalifeifei/papers/probhist-talk.pdfh(i) = di=dn=m ee...

Documents