fast algorithms for hierarchical range histogram constructions authors sudipto guha, nick koudas,...

63
Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range Histogram Histogram Constructions Constructions Authors Authors Sudipto Guha, Nick Koudas, Divesh Sudipto Guha, Nick Koudas, Divesh Srivastava. Srivastava. ACM PODS ’2002s ACM PODS ’2002s

Upload: nichole-longway

Post on 14-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range

Histogram Histogram ConstructionsConstructions

Fast Algorithms For Fast Algorithms For Hierarchical Range Hierarchical Range

Histogram Histogram ConstructionsConstructions

AuthorsAuthorsSudipto Guha, Nick Koudas, Divesh Sudipto Guha, Nick Koudas, Divesh

Srivastava.Srivastava.ACM PODS ’2002sACM PODS ’2002s

Page 2: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

Page 3: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Introduction• Data Warehousing and OLAP applications

– OLAP – Online analytical processing

• Data has multiple logical dimensions with natural hierarchies defined on it

• OLAP queries – usually involve hierarchical selections on

some of the dimensions – often aggregate measure attributes

Page 4: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Introduction – Cont.

Page 5: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Histograms• Numeric attribute value domain • Space-efficient • Conditions on a given dimension -

hierarchical ranges • Range estimation depends on a

good solution to the histogram construction problem

Page 6: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

The Main Idea• Proposes a fast practical

algorithms for the problem of constructing hierarchical range histograms

Page 7: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

The Main Contributions• A novel notion of sparse intervals• A proposed algorithm effectively

trades space for construction time without compromising the accuracy

• First practical approach to the problem

Page 8: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Previous Works• V-Optimal histograms

– Minimizes error for equality queries– But… Constructed by taking only equality

queries into account • Koudas et al. - a polynomial-time

algorithm– For special and general cases – But… High polynomiality

• Gilbert et al. – pseudo-polynomial time optimal for arbitrary ranges– But.. High polynomiality

Page 9: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Problem Definition• An array A[1,n] of non-negative real

numbers• The average of items A[a],…,A[b]

1

][...][],[

ab

bAaAbaA

Page 10: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

• A histogram of array A[1,n] using B buckets is specified by B+1 integers

• Each interval is a bucket• Each is a bucket boundary

Histogram Definition

nbbb

bb

B

B

121

11

...0

,...,

],1[ 1 ii bb

ib

Page 11: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Histogram Definition – Cont.

• Stored as – a series of bucket boundaries– the average of the array values

in each bucket – bucket sum can be obtained

],1[ 1 ii bbA

Page 12: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Histograms – Cont.• Mostly support equality queries

– “give me A[i]”

• Hierarchical range queries

Page 13: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Hierarchical Range Queries Definition

• A range query asks for the sum

• A set S of range queries is hierarchical if for any two queries and in S, the ranges [i,j] and [k,l] are– disjoint– or contained one in the other

ijR][...][ jAiAsij

ijR klR

Page 14: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Hierarchical Range Queries – Cont.

• Generalize equality queries

• Can be displayed as a tree– Each node u has an associated range– Node v is a child of node u iff and

there is no w such that

][iARii

uruv rr

uwv rrr

Page 15: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Workload Definition• A workload W consists of

– A set S of hierarchical range queries– A probability for each query in

S this probability can be obtained by monitoring and logging

• Simple probabilities model

ijp ijR

Page 16: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

How The Histogram Works1. A histogram H of array A[1,n]2. Query 3. An expected answer 4. Left bucket such that5. Right bucket such that 6. Calculate precise total of the values in

the buckets between left and right buckets

7. Estimate the sums for the portions within the left and right buckets

ijR][...][ jAiAsij

],1[ 1 ll bb 11 ll bib],1[ 1 rr bb

11 rr bjb

Page 17: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

How The Histogram Works – Cont.

8. The sum of A in the interval is estimated by

– Uniformity assumption

9. The right bucket likewise

],1[],[ 1 ll bbji

],1[],[],1[ 11 llll bbjibbA

Page 18: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

The Total Estimate

• The total estimate

• left bucket estimation +right bucket estimation +

exact sum for buckets in between

ijs

ijs

Page 19: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Determining the average

• Construct a prefix sum array for all

• Given and return the average at constant time

ib

jjA

1][

ib

ib 1ib

Page 20: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Optimal histogram definition

• The error of the range query is

• Given a histogram H and workload W the total expected error for estimating W is over all queries in W

ije ijR2)ˆ( ijijij sse

ijR ijijep )(

Page 21: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Optimal Histogram Definition – Cont.

• Given W, an optimal histogram with B buckets of array A[1,n] is the histogram with at most B buckets that has the minimum total expected error for estimating workload W among all histograms with at most B buckets

Page 22: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Fast Histogram (FH) Construction for Hierarchical

Range Queries • Given an array A[1,n], B buckets

and workload W• E denotes the total expected error

of the optimal histogram• Find algorithms that construct HR

histograms with an error at most E trading space and construction time

Page 23: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

Page 24: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

FH construction• Constructing a set of “sparse

intervals”– Increases a number of buckets– Any arbitrary interval can be

represented

• Dynamic programming algorithm

Page 25: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

A Sparse Interval System

• Given an integer set • Level 1 points: • Level 2 points: • Level j+1 points:• Last r+1 level points:

1r 11

rnl

n,...,0

...3,2,,0 lll

...3,2,,0 jjj lll

n,0

Page 26: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

A Sparse Interval System

• The interval [0,n] is in the sparse system S

• Any pair of level j points between level j+1 points defines in interval in S

Page 27: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

A Sparse Interval System Example

n=8 ; r=3 ; l=2

0 2 4 6 81 3 5 70 4 81 2 3 5 6 70 81 2 3 4 5 6 70 81 2 3 4 5 6 7

Level 2 pointsLevel 3 pointsLevel 4 pointsLevel 1 points

Page 28: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Sparse Interval System Properties

• Any interval over [0,n] can be written as a disjoint union of at most 2r intervals in the sparse system

Page 29: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Claim• Any interval [0,x] can be

expressed as a partition of at most r intervals from the sparse system

Page 30: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Claim Proof• By induction• Induction step Any interval where can be

expressed as j intervals. • Base case

true for j=1

],0[ x jlx

Page 31: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Claim Proof – Cont.• j+1• Consider• We can write the interval as and

where t is maximal • is a valid interval in the sparse

system (in level j+1 - 0 and are adjacent)

1 jj lxl

],0[ jtl

],[ xtl j )( lt ],0[ jtl

1jl

Page 32: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Claim Proof – Cont.• is essentially similar to • since t is maximal. Therefore by induction can be expressed by j

intervals• Total j+1• Since any interval can be

expressed as a union of r intervals

],[ xtl j ],0[ jtlx jj ltlx

rlx ],0[ x

Page 33: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Observation• Any interval can be expressed as

intervals • By cutting it in a point of the form with maximum j• By symmetry and can be expressed as a disjoint union

],[ ba r2

bala j

],[ jala ],[ bal j

Page 34: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Lemma• In a sparse set system with

parameter r, the number of intervals containing a point is at most

)(2

rrnO

Page 35: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Lemma Proof• Consider the level 1 intervals• There are at most such intervals

that contain a specific point– There are l points between adjacent points

of level 2– l points can create at most intervals

• Level j intervals behave on level j points the same as level 1 points on the original points

• Extend to r levels…(r+1’th level adds one more interval)

)( 2lO

)( 2lO

Page 36: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Layout• Introduction• Related Works• Problem Definition• Problem Solution

– A Sparse Interval Set System– The Dynamic Programming algorithm

• Experimental Evaluation• Conclusions

Page 37: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Hierarchy Representation By a

Tree• Ranges define a hierarchy based on the

inclusion relationship• T is a hierarchy representation by a

tree– Each node v of T is associated with a range

– The weight is – The error is

],[ RLij vvR

vw ijpve ije

Page 38: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Representation By a Binary Tree

• We allow • If a node had children transform it into a node

with two children – – a new node with weight 0

• The size of a tree increases only by factor 2• So assume that the tree is binary

0uwtuu ,...1

1u tuu ,...2

Page 39: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Dynamic Programming Algorithm - FH

• Best(v,left,right,p) denotes the smallest error of the range

• v – tree node associated with• left – overlapping interval on the left• right - overlapping interval on the left• v contains p intervals completely• Formally, left contains and right contains

],[ RL vv],[ RL vv

LvRv

Page 40: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

FH stages• Let the children of v to be y and z

with ranges and • Cases (a) + (b)

],[ RL yy ],[ RL zz

Page 41: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Cases (a)+(b)• For all possible intervals I that

contain and ,compute

• In the case that I finishes on

Ry Lz

)1,,,((min)(cos 11

kpIleftyBestewewIt zzyyk

)),,,( 1krightIzBest

),,,((min)(cos 11

kpIleftyBestewewIt zzyyk

)),,,( 1krightzBest

Ry

Page 42: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Cases (a)+(b)

Page 43: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

FH stages – Cont.• Return • When interval I is fixed, and

are automatically defined and can be counted in O(1) time.

)(min ICostI

ye ze

Page 44: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Time complexity• Time spent evaluating cost(I) is O(p)• The running time depends on the

number of choices of interval I• Let C(S) be the maximum number of

intervals in an interval system S that contain a particular element ( )

• If all intervals are allowed then

)()( 2nOSC

Ry

Page 45: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Time complexity – Cont.

• The running time of the algorithm FH is

• The number of entries for each tree node v is – Since there are C(S)+1 intervals for

choices of left (all intervals that contain and ). Similarly for right

• Work for every tree node

))(( 2SBCO

Lv

))(())()(( 32 SCBOSBCSpCO

))(( SpCO

Page 46: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Time complexity – Cont.

• Total work including preprocessing is

• When S is a set of all possible intervals

• The result matches the time complexity of the previous algorithm (for arbitrary intervals)

))(( 32 SCBTnO

)( 26 TBnO

Page 47: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Time Complexity For a Sparse Set System

• S – a sparse set system with parameter r

• Run FH with 2r(B-1) buckets • Error - less then or equal to the

original B bucket histogram– A histogram with B buckets can

be expressed as a histogram with 2r(B-1) buckets in sparse system

Page 48: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Time Complexity – Cont.

• • Set

– In time we can construct a solution with buckets whose error is at most the error of any solution of the original problem with B buckets

66

r

)( 52 TnBnO

)/( BO

rrnSC2

)(

Page 49: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Some Notes• Get alternate tradeoffs by constructing different

sparse set system– Complete binary tree on [0,n] – Allow intervals such that one end point is an

ancestor of the other– Any arbitrary interval can be expressed as a

disjoint union of two intervals from the sparse set

– C(S) = O(n)– Solution with 2B buckets in time

)( 23 BTnO

Page 50: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Experimental Evaluation

• FH was implemented with r=6• Compared to an algorithm A0

presented by Gilbert et al.– Optimized for arbitrary range queries – For a data series of length n to be

approximated with B buckets, constructs a histogram consisting of 2B buckets in time

– The only known algorithm with reasonable complexity

)( 2BnO

Page 51: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Description of Data Sets

• A: A real data set of length 1000 extracted from an AT&T operational warehouse

• B: A synthetic data set of length 2000, distributed Zipf with skew parameter 0.5

• C: A synthetic data set of varying length, represented samples from Gaussian distribution with mean and variance 250

Page 52: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Workload Description• A normal used to assign the

probabilities to a full hierarchy• Then normalization to obtain a

probability distribution • W1 – generated by sampling N(10,10)• W2 – generated by sampling N(10,50)

Page 53: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Performance Evaluation

• Accuracy and construction time • Parameters

– Total space allowed for histogram– Total size of the data set

Page 54: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Computing Accuracy• Ask 1000 queries • Report the total expected sum

squared error of the workload execution on the histogram

Page 55: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Data Set A

Page 56: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Set A – Cont.

• The accuracy of FH is superior to A0• FH is more accurate for smaller

variance (W1) • As the variance increases, gets closer

to uniform (A0 optimized) • A0 linear in the space • FH is better in construction time for the

same range of space

Page 57: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Data Set B

Page 58: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Set B –Cont.• Similar to A• Accuracy improves much faster

with space– since the distribution is Zipf

• The savings in construction time for FH are dramatic– since data set B is twice the size of A

Page 59: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Data Set C

Page 60: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Results for Set C – Cont.

• Data set size increases (x axis) and total space 20

• A0 has a plateau– Due to the way the data is generated

in the experiment (Gaussian tail)• Quadratic trend in construction time for

A0• FH – near-linear increasing in

construction time

Page 61: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Conclusions• The first practical approach to the

problem of constructing hierarchical range histograms

• The dynamic programming algorithms effectively trade space and construction time without compromising histogram accuracy

• A novel notion of sparse intervals

Page 62: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

Future plans• A formal study of the dynamic

properties of hierarchical range histograms

• How should one modify these histograms under data or workload modifications?

Page 63: Fast Algorithms For Hierarchical Range Histogram Constructions Authors Sudipto Guha, Nick Koudas, Divesh Srivastava. ACM PODS ’ 2002s

The ENDThe ENDThe ENDThe END

Thanks for listeningThanks for listening