![Page 1: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/1.jpg)
Dynamic Sample Selection for Approximate Query Processing
Brian BabcockStanford University
Surajit Chaudhuri
Microsoft Research
Gautam DasMicrosoft Research
![Page 2: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/2.jpg)
Why Approximation is Useful Large data warehouses
Gigabytes to terabytes of data Data analysis applications
Decision support Data Mining
Query characteristics: Access large fraction of database Seek to identify general patterns / trends Absolute precision unnecessary
$89,000 after 5 secs vs. $89,034.57 after 2 hrs
![Page 3: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/3.jpg)
Two Phases of Approximate Query Processing (AQP)
1. Offline pre-processing of the database E.g. generate histograms or random
samples OK to use considerable space and time
(hours)
2. Runtime query processing Query answers must be fast (seconds) Only time to access small amount of data E.g. extrapolate from random sample
![Page 4: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/4.jpg)
AQP Example
Product
Amount
CPU 1
CPU 1
CPU 2
CPU 3
CPU 4
Disk 1
Disk 2
Monitor 1
Product
Amount
CPU 1
CPU 2
CPU 3
Disk 2
Sales SalesSample
SELECT SUM(Amount) FROM SalesWHERE Product = 'CPU'
Exact Answer: 1+1+2+3+4 = 11
Approx. Answer: (1+2+3)*2= 12
![Page 5: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/5.jpg)
Non-uniform Sampling “Biased” samples often more accurate
than uniform samples All data records are not created equal
Frequently queried values Extreme high and low values Uncommon values
Optimal bias differs from query to query Past work: carefully select biased sample
to give good answers for many queries
![Page 6: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/6.jpg)
Related Work Non-sampling-based approaches
Online Aggregation Hellerstein, Haas, and Wang 97 Histograms Ioannidis and Poosala 99 Wavelets Chakrabarti, Garofalakis, Rastogi, and Shim 00
Sampling-based approaches AQUA project Acharya, Gibbons, and Poosala 99 Congressional Acharya, Gibbons, and Poosala 00 Self-Tuning Ganti, Lee, and Ramakrishnan 00 Outliers Chaudhuri, Das, Datar, Motwani, and Narasayya
01 Workload Chaudhuri, Das, and Narasayya 01
![Page 7: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/7.jpg)
Dynamic Sample Selection
DA
TA
SA
MPLE
Standard Sampling
DA
TA
SA
MPLE
SA
MPLE
SA
MPLE
SA
MPLE
Dynamic Sample Selection
![Page 8: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/8.jpg)
Dynamic Sample Selection
Construct many differently-biased samples For each query, use the best sample and
ignore the others
How to pick a good set of samples?
Given a query, what’s the best sample?
Improved accuracy, no change to query time Query time is the scarce resource OK to use extra pre-processing, disk space
![Page 9: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/9.jpg)
Small vs. Large Groups Consider group-by aggregation queries.
E.g. Total sales of CPUs in each state E.g. Avg sale price for each product in each state
Number of records per group may vary widely
Problem: Rare values are under-represented in uniform sample “California” much more common than “Alaska” “Alaska” only appears a few times in the sample Approximate answer for “Alaska” likely to be bad
In a group-by query, small groups are hard
![Page 10: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/10.jpg)
Small Group Sampling
Large Groups: Use Uniform Random Sample
Well-represented in sample Good quality of approximation
Main idea: Treat small and large groups differently
![Page 11: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/11.jpg)
Small Group Sampling
Small Groups: Use Original Data Contain few records, by definition Thus can be scanned very quickly
Main idea: Treat small and large groups differently
![Page 12: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/12.jpg)
Small Group Sampling
Small groups are query-dependent Depend on grouping attributes Depend on selection predicates
How do we know which rows to scan to find the small groups?
Main idea: Treat small and large groups differently
![Page 13: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/13.jpg)
Finding the Small Groups Heuristic idea:
Most small groups in most queries have a rare value for at least one grouping
attribute Small group in this query rare value in entire
DB Not always true (snowblower sales in California)
Summary of Small Group Sampling: Identify rare values during pre-processing Store rows with rare values in a different (small) table
for each attribute: the small groups tables At query time, scan small groups table for each
grouping attribute
![Page 14: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/14.jpg)
Pre-Processing Steps Create a table sample_all containing a
uniform random sample of all data For each attribute A in the schema:
Identify rare values for attribute A Create a table smGrps_A containing all
records with rare A values Size of smGrps_A table limited by threshold
(2:1 ratio between sample_all and smGrps)
smGrps_B
smGrps_D
smGrps_C
smGrps_Asample_all
![Page 15: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/15.jpg)
Pre-Processing Steps Augment rows in sample_all, smGrps_* with table
membership information Some rows may be added to multiple tables One extra bitmask column: which small group tables
contain this row? Used to avoid double-counting during query processing
smGrps_B
smGrps_D
smGrps_C
smGrps_Asample_all
DATA
![Page 16: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/16.jpg)
smGrps_B
smGrps_A
Answering Queries Using Small Group Sampling
Values of attribute AV
alu
es
of
att
rib
ute
B
Common RareC
om
mo
nR
ar
esample_all
![Page 17: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/17.jpg)
Query Answering Example Run query on small group table for each grouping attribute Run scaled query on sample_all Combine answers
SELECT A,B,COUNT(*)FROM FACT_TBLWHERE C=10GROUP BY A,B
SELECT A,B,COUNT(*) as cntFROM smGrps_A WHERE C=10 GROUP BY A,B UNION ALL
SELECT A,B,100 * COUNT(*) as cntFROM sample_allWHERE C=10 AND bitmask & 3 = 0GROUP BY A,B
SELECT A,B,COUNT(*) as cntFROM smGrps_B WHERE C=10 AND bitmask & 1 = 0GROUP BY A,B UNION ALL
![Page 18: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/18.jpg)
Experimental Setup Two data sources
Skewed version of TPC-H benchmark database Real-world database: 1 month of product sales
Randomly generated queries Compared different AQP methods
Small Group, Uniform, Basic Congress Each allowed to query same number of rows
Evaluating approximate answers Average relative error in approximate answer
across groups Number of groups absent from approximate
answer (not present in sample)
![Page 19: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/19.jpg)
Relative Error – TPC-H
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
1 2 3 4
Number of grouping columns
Ave
rag
e R
elat
ive
Err
or
SmGroup Uniform
![Page 20: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/20.jpg)
Groups Missed – TPC-H
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
1 2 3 4
Number of grouping columns
Per
cen
t o
f G
rou
ps
Mis
sed
SmGroup Uniform
![Page 21: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/21.jpg)
Relative Error – Sales Data
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
1 2 3 4
Number of grouping columns
Ave
rag
e R
elat
ive
Err
or
SmGroup
Basic Congress
Uniform
![Page 22: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/22.jpg)
Groups Missed – Sales Data
0%
5%
10%
15%
20%
25%
30%
35%
40%
1 2 3 4
Number of grouping columns
Per
cen
t o
f G
rou
ps
Mis
sed
SmGroup
Basic CongressUniform
![Page 23: Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research](https://reader036.vdocuments.us/reader036/viewer/2022062511/551affdf550346cf5a8b46b1/html5/thumbnails/23.jpg)
Summary Dynamic Sample Selection
Gain accuracy at the cost of disk space. Non-uniform samples are good, but different
ones are good for different queries. Build lots of different non-uniform samples. For each query, pick the best sample.
Small Group Sampling Treat large and small groups differently. Uniform sampling works well for large groups. Small groups are cheap to scan in their
entirety.