presented by srikanth vadada fall 2010 - cse 6339 23 rd sep 2010
Post on 20-Feb-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
University of Texas at Arlington
Presented By
Srikanth Vadada
Fall 2010 - CSE 6339
23rd Sep 2010
Dynamic Sample Selection for Approximate Query Processing
Brian Babcock, Surajit Chaudhuri, Gautam DasACM SIGMOD 2003
University of Texas at Arlington
3 Key Terms for the Topic
Dynamic Selection
Biased SampleApproximate Query Processing (AQP)
Goal : Dynamically construct an appropriate Biased Sample for Approximate Query Processing
University of Texas at Arlington
Why Approximate Query Processing (AQP) ? Rapid strides in Data Collection & Management Technologies
Resulting in very large Databases
Effective Data Analysis methods – Ongoing Research Analysis Queries require aggregation or summarizations Expensive Running Times
Requirements of Analysis Systems (Decision Support Systems) Short Query Response Time Exactness of Query Results less important
AQP Techniques are the Solution
University of Texas at Arlington
Why Sampling Techniques ? Data Analysis Queries
OLAP – Materialized Views of Data Cubes
Building Indexes
Physical Data Design – Use Preprocessing Time & Space
Effective when Query Workload is known in advance
Expensive to build indexes for all possible queries
AQP & Physical Database Design Methods Complementary
University of Texas at Arlington
Approximate Query Processing (Related Work)Online Aggregation
Hellerstein J., Haas P., Wang H. Online Aggregation, CM SIGMOD 1997
Approximate answers are produced during early stages
Gradual refinement until data is processed
Advantages: No pre-processing required Allows progressive refinement of answers at runtime
Disadvantages : Require random disk access (slow). Requires query processor code change.
University of Texas at Arlington
Approximate Query Processing (Related Work)Join Synopses - Sampling based method
Acharya S., Gibbons P. B., Poosala V., Ramaswamy S. Join Synopses for Approximate Query Answering, ACM SIGMOD 1999
Join Queries – Primary Key Joins
Pre Computation - Join of Fact Tables with Dimension Tables
Disadvantages: Does not extend to queries that involve non-foreign key joins
University of Texas at Arlington
Approximate Query Processing (Related Work)Icicles - Weighted Sampling method
Ganti V., Lee M. L., Ramakrishnan R. ICICLES: Self-tuning Samples for Approximate Query Answering, VLDB, 2000.
Frequency of Tuple Access by Queries in Workload
Disadvantages: Addresses only low selectivity problem
University of Texas at Arlington
Dynamic Sample Selection “Appropriate Biased” Samples
Give accurate approximate results for most queries
Appropriate Biased Sample varies from Query to Query
Previous Sampling Methods vs Dynamic Sample Selection Previous sampling - Single Sample with fixed bias
Individual Tailored Sample for each query
Creation of subsamples is done offline - Preprocessing
Assembly into an overall sample is done online - Runtime
University of Texas at Arlington
Effectiveness of Biased Sampling - Example Database consists of 100 Product Tuples
Product = “Stereo” - 90 Tuples Product = “TV” – 10 Tuples
Sampling 10 tuples in 2 ways 10% of the tuples uniformly each with weight 10 0% of “Stereo” tuples and 100% “TV” tuples with weight 1
Query – Count of “TV” Tuples Which gives a correct answer always?
2nd Sample - Always gives the exact answer
1st Sample - Only if exactly 1 of the TV tuples is chosen
University of Texas at Arlington
DA
TA
SAM
PLE
DA
TA
SAM
PLESA
MPLE
SAM
PLESA
MPLE
Dynamic SamplingStandard Sampling
Static vs Dynamic Sampling Selection
University of Texas at Arlington
Dynamic Sample Selection Architecture ( 1 / 3)Pre- Processing Phase
Extra Disk space not taken advantage by Standard Sampling Methods
DSS uses this effectively by creating a large sample containing a family of differently biased subsamples
Step 1 - Examine Data Distribution for creating a set of biased samples – Results into Overlapping Strata
Step 2 – Samples are created with potentially different sampling rates for each stratum. Generate Metadata – Characteristics of each sample
Query Workload
Data
Select Strata
Build Sample
Sample Data
Meta-Data
Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.
University of Texas at Arlington
Dynamic Sample Selection Architecture ( 2 / 3)Runtime Phase
When Queries are issued at Runtime - DSS re-writes the queries to run against sample tables
Appropriate Sample Tables to use are determined by comparing Query with the Meta data
Algorithms for choosing which samples to build in pre processing and samples to use for Query Processing are not described
Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.
Query
Meta-Data
Sample Data
ChooseSamples
RewriteQuery
University of Texas at Arlington
Dynamic Sample Selection Architecture ( 3 / 3) Policies for Sample Selection
Choice of Sample guided by incoming Query Syntax
Examples Separate sample for each table and choose sample
based on the FROM clause
Separate sample for each pre-specified aggregate expression and choose sample based on the SELECT clause
University of Texas at Arlington
Motivation - Small Group Sampling (1 / 2)
University of Texas at Arlington
Motivation - Small Group Sampling ( 2 / 2)
Large Groups
Small Groups
University of Texas at Arlington
Small Group Sampling - Example Example : Aggregation queries with “group-bys”
select Age, Income, count(*)
from Employee_Tbl
group by Age, Income
Age Income (In 1000s) Designation30 60 Developer
35 70 Developer
25 60 Developer
25 70 Lead
30 70 Lead
25 70 Developer
25 70 Developer
30 60 Developer
30 60 Developer
40 70 Developer
35 100 Manager
25 60 Developer
25 70 Developer
35 100 Developer
30 60 Developer
University of Texas at Arlington
Small Group Sampling – Illustration (1/3) sample – perform uniform sampling on large groups. Small group tables - one or more sample tables for smaller groups.
Pre-Processing Phase:1. Create a overall sample s_overall 2. “Age” Histogram ( Column Index: 0)
r : Base Sampling rate, determines the size of Overall Sample (eg, 30%) t : small group fraction, max size of each small group table (eg, 20%)
Small group table
s_age
s_overall
University of Texas at Arlington
Pre-Processing Phase:“Income” Histogram (Column Index: 1)
s_income
s_overall
011
011
Small Group Sampling – Illustration (2/3)
University of Texas at Arlington
Query Issued - SELECT Age, Income, count(*) FROM Employee_tbl GROUP BY Age, Income
SELECT Age, Income, count(*) FROM s_age GROUP BY Age, IncomeUNION ALLSELECT Age, Income, count(*) FROM s_income GROUP BY Age, IncomeWHERE Bitmask & 1 = 0 /* ie, 001 . (eg, 010 & 001 = 000 ; 011 & 001 = 1)*/UNION ALLSELECT Age, Income, count(*) * (100/30) FROM s_overall GROUP BY Age, IncomeWHERE Bitmask & 3 = 0 /* 3 = 20 + 21 ie, 011 (eg, 001 & 011 = 1; 011 & 011 = 1; 010 & 011 = 1)*/
s_income (Column Index: 1 - 010)
s_age (Column Index: 0 - 001)
s_overall
Small Group Sampling – Illustration (3/3)Runtime Phase:
University of Texas at Arlington
Accuracy Metrics (1 / 2)
As many possible groups to be preserved in approximate answer
Error in the aggregate value for each group should be small
Q = Aggregation QueryLet G = {g1, g2, g3, … gn} be the set of n groups in the answer to Qxi = aggregate value for group gi.
A= Approximate Answer to QG’ = {gi1, gi2, gi3, … gim} be the set of m groups in Ax’i1 = aggregate value for group gij.
University of Texas at Arlington
Accuracy Metrics (2 / 2)
Percentage of Groups from Q missed by A
Average relative error on Q of A
Average squared relative error on Q of A
University of Texas at Arlington
Experimental Results ( 1 / 2)
TPC – H Database , Count & Sum queries , Number of Columns in all Tables =245
RelErr, PctGroups increased for Uniform Sampling & Small Group Sampling
Increase was more pronounced for Uniform SamplingFig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.
University of Texas at Arlington
Experimental Results ( 2 / 2)
Uniform Sampling outperforms Small group sampling at low skews
Small group sampling does better at moderate to high skew
Speedup decreases as the number of grouping columns increase
Fig Source : Babcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing, ACM SIGMOD 2003.
University of Texas at Arlington
Conclusions
Dynamic Sample Selection improves on previous AQP Methods
Productively utilizes additional Disk Space
Small Group Sampling targets aggregate queries with group bys
Small Group sampling outperforms other techniques
University of Texas at Arlington
ReferencesBabcock B., Chaudhuri C. and Das G. Dynamic Sample Selection for Approximate Query Processing
http://crystal.uta.edu/~cse6339/Fall08DBIR.htm
http://crystal.uta.edu/~cse6339/Fall09DBIR.htm
University of Texas at Arlington
AQ&Questions ?Questions ?
top related