1 wavelet synopses with error guarantees minos garofalakis phillip b. gibbons information sciences...

Wavelet synopseswith Error Guarantees

Minos Garofalakis Phillip B Gibbons1048576Information Sciences Research Center Bell Labs Lucent Technologies Murray Hill NJ 07974

ACM SIGMOD 2002

Outline

Introduction Wavelet basics Probabilistic wavelet synopses Experimental study Conclusions

Introduction The wavelet decomposition has demonstrated

the effectiveness in reducing large amounts of data to compact sets of wavelet coefficients (termed ldquowavelet synopsesrdquo) that can be used to provide fast and reasonably accurate approximate answers to queries

Due to exploratory nature of many Decision Support Systems applications there are a number of scenarios in which the user may prefer a fast approximate answer

Introduction A major criticism of wavelet-based

techniques is the fact that conventional wavelet synopses can not provide guarantees on the error of individual approximate query answers

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

M[jB] depicted in (11)

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

Probabilistic wavelet synopsesF Summary of the approach

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

Experimental study

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Outline

Introduction Wavelet basics Probabilistic wavelet synopses Experimental study Conclusions

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

equivalent to

Experimental study

1 wavelet synopses with error guarantees minos garofalakis phillip b. gibbons information sciences...

conventional wavelet

wavelet decomposition

building wavelet synopses

b wavelet coefficients

wavelet basics d

major criticism of wavelet

loss slide

total of b c

Documents

minosn. garofalakis - softnetminos/vita.pdf · 8. minos...

estimating join-distinct aggregates over update streams...

streaming in a connected world: querying and tracking...

synopses for massive data: samples, histograms, wavelets...

portfolio georgios garofalakis

a quick introduction to approximate query processing part ii...

a quick introduction to data stream algorithmics minos...

based on the original slides by minos garofalakis yahoo! …...

sharing aggregate computation for distributed queries ryan...

1 probabilistic data management for the digital home: the...

odysseas papapetrou, minos garofalakis, antonios...

processing data-stream joins using skimmed sketches minos...

intel research sketching streams through the net:...

deterministic wavelet thresholding for maximum-error metrics...

declarative networking: language, execution and optimization...

xcluster synopses for structured xml content alkis polyzotis...

approximate xml query answers neoklis polyzotis (uc santa...

correlating xml data streams using tree-edit distance...

adaptive cleaning for rfid data streams vldb 2006 9/12/06...

a pproximate q uery p rocessing u sing w avelets kaushik...