1 approximation and load shedding for qos in dsms* cs240b notes by carlo zaniolo csd--ucla...
Post on 22-Dec-2015
219 views
TRANSCRIPT
1
Approximation and Load Sheddingfor QoS in DSMS*
CS240B Notes
By
Carlo Zaniolo
CSD--UCLA
________________________________________ * Notes based on a VLDB’02 tutorial by Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi
2
Synopses and Approximation
Synopsis: bounded-memory history-approximation Succinct summary of old stream tuples Like indexes/materialized-views, but base data is
unavailable
ExamplesSliding WindowsSamplesHistogramsWavelet representationSketching techniques
Approximate Algorithms: e.g., median, quantiles,…
Fast and light Data Mining algorithms
3
Overview of Stream Synopses
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
4Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #
Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-
represents all the data– For a fast approx answer, apply “modified” query to S
– Example: select agg from R where odd(R.e) (n=12)
– If agg is avg, return average of odd elements in S
– If agg is count, return average over all elements e in S of
• 1 if e is odd
• 0 if e is even
Unbiased: For expressions involving count, sum, avg: the estimatoris unbiased, i.e., the expected value of the answer is the actual answer
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
Sample S: 9 5 1 8
answer: 5
answer: 12*3/4 =9
5
Probabilistic Guarantees
Example: Actual answer is within 5 ± 1 with prob 0.9
Use Tail Inequalities to give probabilistic bounds on returned answer Markov Inequality Chebyshev’s Inequality Hoeffding’s Inequality Chernoff Bound
6
Sampling—some background
Reservoir Sampling [Vit85]: Maintains a sample S having a pre-assigned size M on a stream of arbitrary size Add each new element to S with probability M/n, where
n is the current number of stream elements If add an element, evict a random element from S Instead of flipping a coin for each element, determine
the number of elements to skip before the next to be added to S
Concise sampling [GM98]: Duplicates in sample S stored as <value, count> pairs (thus, potentially boosting actual sample size)
Counting Samples [GM98]: for answering hot list queries (k most frequent values)
Window Sampling [BDM02,BOZ08]. Maintains a sample S having a pre-assigned size M on a window on a stream—reservoir sampling with expiring tuples.
7
Load Shedding Using Samples
Given a complex Query graph how to use/manage the sampling process [BDM04]
More about this later [LawZ02]
8
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
9
Histograms
Histograms approximate the frequency distribution of element values in a stream
A histogram (typically) consists of A partitioning of element domain values into buckets
A count per bucket B (of the number of elements in B)
Widely used in DBMS query optimization
Many Types of Proposed: Equi-Depth Histograms: select buckets such that counts per bucket
are equal V-Optimal Histograms: select buckets to minimize frequency variance
within buckets Wavelet-based Histograms
10Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #
Types of HistogramsTypes of Histograms• Equi-Depth Histograms
– Idea: Select buckets such that counts per bucket are equal
• V-Optimal Histograms [IP95] [JKM98]
– Idea: Select buckets to minimize frequency variance within
buckets
Count forbucket
Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count forbucket
Domain values1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2)( minimizeB
BB Bv v V
Cf
11
Equi-Depth Histogram Construction
For histogram with b buckets, compute elements with rank n/b, 2n/b, ..., (b-1)n/b
Example: (n=12, b=4)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
After sort: 1 1 2 3 4 5 5 6 7 8 9 9
rank = 3(.25-quantile)
rank = 6(.5-quantile)
rank = 9(.75-quantile)
12Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #
Answering Queries Histograms [IP99]Answering Queries Histograms [IP99]• (Implicitly) map the histogram back to an approximate relation, & apply the query to the approximate relation
• Example: select count(*) from R where 4 <= R.e <= 15
• For equi-depth histograms, maximum error:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count spreadevenly amongbucket values
4 R.e 15
answer: 3.5 * BC
BC*2
13
Approximate Algorithms
Quantiles Using Samples Quantiles from Synopses One pass algorithms for approximate
samples … Much work in this area … omitted
14
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
15Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #
One-Dimensional Haar One-Dimensional Haar Wavelets Wavelets • Wavelets: Mathematical tool for hierarchical
decomposition of functions/signals
• Haar wavelets: Simplest wavelet basis, easy to understand and implement – Recursive pairwise averaging and differencing at different
resolutions
Resolution Averages Detail Coefficients[2, 2, 0, 2, 3, 5, 4, 4]
[2, 1, 4, 4] [0, -1, -1, 0]
[1.5, 4] [0.5, 0]
[2.75] [-1.25]
----3
2
1
0
Haar wavelet decomposition: [2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
16Garofalakis, Gehrke, Rastogi, VLDB’02 #Garofalakis, Gehrke, Rastogi, VLDB’02 #
Haar Wavelet Coefficients Haar Wavelet Coefficients
Coefficient “Supports”
2 2 0 2 3 5 4 4
-1.25
2.75
0.5 0
0 -1 0 -1
+
-+
+
+ + +
+
+
- -
- - - -
+
-+
+ -+ -
+-+-
-++-
-1 -1
0.5
0
2.75
-1.25
0
0
• Hierarchical decomposition structure (a.k.a. “error tree”)
Original frequency distribution
17
Compressed Wavelet Representations
Key idea: Use a compact subset of Haar/linear wavelet coefficients for approximating frequency distribution
Steps Compute cumulative frequency distribution C Compute linear wavelet transform of C Greedy heuristic methods
Retain coefficients leading to large error reduction
Throw away coefficients that give small increase in error
18
Overview
Windows: logical, physical (covered)
Samples: Answering queries using samples
Histograms: Equi-depth histograms, On-line quantile computation
Wavelets: Haar-wavelet histogram construction & maintenance
Sketches
19
Sketches
Conventional data summaries fall short: Quantiles and 1-d histograms: Cannot capture attribute
correlations Samples (e.g., using Reservoir Sampling) perform poorly for
joins Multi-d histograms/wavelets: Construction requires multiple
passes over the data
Different approach: Randomized sketch synopsesRandomized sketch synopses Only logarithmic space Probabilistic guarantees on the quality of the approximate
answer Can handle extreme cases.
20
Overview
Windows: logical, physical (covered) Samples: Answering queries using samples Histograms: Equi-depth histograms, On-line
quantile computation Wavelets: Haar-wavelet histogram construction
& maintenance
SketchesQoS by load shedding.
21
QoS and Load Schedding
When input stream rate exceeds system capacity a stream manager can shed load (tuples)
Load shedding affects queries and their answers: drop the tasks and the tuples that will cause least loss
Introducing load shedding in a data stream manager is a challenging problem
Random load shedding or semantic load shedding
22
Load Shedding in Aurora
QoS for each application as a function relating output to its utility
– Delay based, drop based, value basedTechniques for introducing load shedding
operators in a plan such that QoS isdisrupted the least
– Determining when, where and how much load to shed
23
Load Shedding in STREAM
Formulate load shedding as an optimization problem for multiple sliding window aggregate queries
– Minimize inaccuracy in answers subject to output rate matching or exceeding arrival rate
Consider placement of load shedding operators in query plan
– Each operator sheds load uniformly with probability pi
24
References
[BDM02] B. Babcock, M. Datar, R. Motwani, ”Sampling from a moving window over streaming data”, Proceedingsof the thirteenth annual ACM-SIAM Symposium on Discrete Algorithms, p.633–634, 2002.
[BOZ 08]Vladimir Braverman, Rafail Ostrovsky, Carlo Zaniolo Succinct Sampling on Streams, submitted for publication.
[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.
[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving Approximate Query Answers”. ACM SIGMOD 1998.
[BDM04] Brian Babcock, Mayur Datar, Rajeev Motwani: Load Shedding for Aggregation Queries over Data Streams. ICDE 2004: 350-361.
[lawZ08] Yan-Nei Law and Carlo Zaniolo: Improving the Accuracy of Continuous Aggregates and Mining Queries on Data Streams under Load Shedding. International Journal of Business Intelligence and Data Mining, 2008.