approximation and load shedding for qos in dsms*

1

Approximation and Load Sheddingfor QoS in DSMS*

CS240B Notes

By

Carlo Zaniolo

CSD--UCLA

________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

2

Synopses and Approximation

Synopsis: bounded-memory history-approximation Succinct summary of old stream tuples Like indexes/materialized-views, but base data is

unavailable

ExamplesSliding WindowsSamplesHistogramsWavelet representationSketching techniques

Approximate Algorithms: e.g., median, quantiles,…

Fast and light Data Mining algorithms

3

Overview of Stream Synopses

Windows: logical, physical (covered)

Samples: Answering queries using samples

Histograms: Equi-depth histograms, On-line quantile computation

Wavelets: Haar-wavelet histogram construction & maintenance

4Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #

Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-

represents all the data– For a fast approx answer, apply “modified” query to S

– Example: select agg from R where odd(R.e) (n=12)

– If agg is avg, return average of odd elements in S

– If agg is count, return average over all elements e in S of

• 1 if e is odd

• 0 if e is even

Unbiased: For expressions involving count, sum, avg: the estimatoris unbiased, i.e., the expected value of the answer is the actual answer

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

Sample S: 9 5 1 8

answer: 5

answer: 12*3/4 =9

5

Probabilistic Guarantees

Example: Actual answer is within 5 ± 1 with prob 0.9

Use Tail Inequalities to give probabilistic bounds on returned answer Markov Inequality Chebyshev’s Inequality Hoeffding’s Inequality Chernoff Bound

6

Sampling—some background

Reservoir Sampling [Vit85]: Maintains a sample S having a pre-assigned size M on a stream of arbitrary size Add each new element to S with probability M/n, where

n is the current number of stream elements If add an element, evict a random element from S Instead of flipping a coin for each element, determine

the number of elements to skip before the next to be added to S

Concise sampling [GM98]: Duplicates in sample S stored as <value, count> pairs (thus, potentially boosting actual sample size)

Counting Samples [GM98]: for answering hot list queries (k most frequent values)

Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.

7

Load Shedding Using Samples

Given a complex Query graph how to use/manage the sampling process [BDM04]

More about this later [LawZ02]

8

Overview





Sketches

9

Histograms

Histograms approximate the frequency distribution of element values in a stream

A histogram (typically) consists of A partitioning of element domain values into buckets

A count per bucket B (of the number of elements in B)

Widely used in DBMS query optimization

Many Types of Proposed: Equi-Depth Histograms: select buckets such that counts per bucket

are equal V-Optimal Histograms: select buckets to minimize frequency variance

within buckets Wavelet-based Histograms


Types of HistogramsTypes of Histograms• Equi-Depth Histograms

– Idea: Select buckets such that counts per bucket are equal

• V-Optimal Histograms [IP95] [JKM98]

– Idea: Select buckets to minimize frequency variance within

buckets

Count forbucket

Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Count forbucket

Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2)( minimizeB

BB Bv v V

Cf

11

Equi-Depth Histogram Construction

For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b

Example: (n=12, b=4)

Data stream: 9 3 5 2 7 1 6 5 8 4 9 1

After sort: 1 1 2 3 4 5 5 6 7 8 9 9

rank = 3(.25-quantile)




Answering Queries Histograms [IP99]Answering Queries Histograms [IP99]• (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation

• Example: select count(*) from R where 4 <= R.e <= 15

• For equi-depth histograms, maximum error:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Count spreadevenly amongbucket values

4 R.e 15

answer: 3.5 * BC

BC*2

13

Approximate Algorithms

Quantiles Using Samples Quantiles from Synopses One pass algorithms for approximate

samples … Much work in this area … omitted

14

Overview





Sketches


One-Dimensional Haar One-Dimensional Haar Wavelets Wavelets • Wavelets: Mathematical tool for hierarchical

decomposition of functions/signals

• Haar wavelets: Simplest wavelet basis, easy to understand and implement – Recursive pairwise averaging and differencing at different

resolutions

Resolution Averages Detail Coefficients[2, 2, 0, 2, 3, 5, 4, 4]

[2, 1, 4, 4] [0, -1, -1, 0]

[1.5, 4] [0.5, 0]

[2.75] [-1.25]

----3

2

1

0

Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]


Haar Wavelet Coefficients Haar Wavelet Coefficients

Coefficient “Supports”

2 2 0 2 3 5 4 4

-1.25

2.75

0.5 0

0 -1 0 -1

+

-+

+

+ + +

+

+

- -

- - - -

+

-+

+ -+ -

+-+-

-++-

-1 -1

0.5

0

2.75

-1.25

0

0

• Hierarchical decomposition structure (a.k.a. “error tree”)

Original frequency distribution

17

Compressed Wavelet Representations

Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution

Steps Compute cumulative frequency distribution C Compute linear wavelet transform of C Greedy heuristic methods

Retain coefficients leading to large error reduction

Throw away coefficients that give small increase in error

18

Overview





Sketches

19

Sketches

Conventional data summaries fall short: Quantiles and 1-d histograms: Cannot capture attribute

correlations Samples (e.g., using Reservoir Sampling) perform poorly for

joins Multi-d histograms/wavelets: Construction requires multiple

passes over the data

Different approach: Randomized sketch synopsesRandomized sketch synopses Only logarithmic space Probabilistic guarantees on the quality of the approximate

answer Can handle extreme cases.

20

Overview

Windows: logical, physical (covered) Samples: Answering queries using samples Histograms: Equi-depth histograms, On-line

quantile computation Wavelets: Haar-wavelet histogram construction

& maintenance

SketchesQoS by load shedding.

21

QoS and Load Schedding

When input stream rate exceeds system capacity a stream manager can shed load (tuples)

Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss

Introducing load shedding in a data stream manager is a challenging problem

Random load shedding or semantic load shedding

22

Load Shedding in Aurora

QoS for each application as a function relating output to its utility

– Delay based, drop based, value basedTechniques for introducing load shedding

operators in a plan such that QoS isdisrupted the least

– Determining when, where and how much load to shed

23

Load Shedding in STREAM

Formulate load shedding as an optimization problem for multiple sliding window aggregate queries

– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate

Consider placement of load shedding operators in query plan

– Each operator sheds load uniformly with probability pi

24

References

[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”, Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms, p.633–634, 2002.

[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication.

[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.

[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998.

[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.

[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.

approximation and load shedding for qos in dsms*

Documents