Download - Top-K Query Evaluation on Probabilistic Data
![Page 1: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/1.jpg)
Top-K Query Evaluation on Probabilistic DataChristopher Ré, Nilesh Dalvi and Dan SuciuUniversity of Washington
![Page 2: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/2.jpg)
Evaluating Complex SQL on PDBs 212/8/2006
High Level Overview
DBMS: Precise answers over clean data Data are often imprecise
Information Integration Information Extraction
Probabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability
This talk: Compute Top-K Efficiently
![Page 3: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/3.jpg)
Evaluating Complex SQL on PDBs 312/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
![Page 4: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/4.jpg)
Evaluating Complex SQL on PDBs 412/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
![Page 5: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/5.jpg)
Evaluating Complex SQL on PDBs 512/8/2006
Example Application
IMDB• Lots of interesting data above
movies (e.g. actors, directors)• Well maintained and clean• But no reviews!
On the web there are lots of reviews
How will I know which movie they
are about?
Alice needs to do information extraction and object reconcillation.
Is a movie good or bad?
Alice wants to do sentiment analysis.
A probabilistic database can help Alice store and query her uncertain data.
Find all years where ‘Anthony Hopkins’ starred in a good
movie
![Page 6: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/6.jpg)
Evaluating Complex SQL on PDBs 612/8/2006
Imprecision is out there…Object Reconciliation
RID Titler124 12 Monkeysr155 Twelve Monkeysr175 2 Monkeyr194 Monk
MID Titlem232 12 Monkeysm143 Monkey Love
Our Approach: Convert scores to probabilities
Data extracted from Reviews
Clean IMDB Data
Output: (RID,MID) pairs
12/8/2006
MatchNo Match
t’ t
Felligi-Sunter Approach: Score (s) each (RID,MID)
![Page 7: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/7.jpg)
Evaluating Complex SQL on PDBs 712/8/2006
Imprecision is out there…
Object Reconciliation
RID Titler124 12 Monkeysr155 Twelve Monkeysr175 2 Monkeyr194 Monk
MID Titlem232 12 Monkeysm143 Monkey Love
RID MID Probr175 m232 0.8r175 m143 0.2
Felligi-Sunter Approach: Score (s) each (RID,MID)
MatchNo Match
t’ t
![Page 8: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/8.jpg)
Evaluating Complex SQL on PDBs 812/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
![Page 9: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/9.jpg)
Evaluating Complex SQL on PDBs 912/8/2006
Query Processing Background
RID MID Probr175 m232 0.8r175 m143 0.2
Query Processing builds event expression
• Intensional Query Processing [FR97]
• Associate to each tuple an event
• Probability event is satisfied = query valueTechnical Point: Projection as last operator implies result is a DNF
![Page 10: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/10.jpg)
Evaluating Complex SQL on PDBs 1012/8/2006
DNF Sampling at a High Level
Estimate p(t),probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive
queries [RDS06,DS04]Randomized Approximation [LK84]
Simulation reduces uncertainty
0.0 1.0Uncertain about p(t)
![Page 11: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/11.jpg)
Evaluating Complex SQL on PDBs 1112/8/2006
Naïve Query Processing
Naïve algorithm (PTIME): Simulate until all small “Epsilon”-small
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
Can we do better?
![Page 12: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/12.jpg)
Evaluating Complex SQL on PDBs 1212/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
![Page 13: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/13.jpg)
Evaluating Complex SQL on PDBs 1312/8/2006
A Better Method: Multisimulation Separate Top-K with few simulations
Concentrate on intervals in Top-K Asymptotically, confidence intervals are nested
Compare against OPT “knows” which intervals to simulate
Evaluating Complex SQL on PDBs 1312/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
![Page 14: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/14.jpg)
Evaluating Complex SQL on PDBs 1412/8/2006
The Critical Region
The critical region is the interval (kth-highest min, k+1st higest max) For k = 2
0.0 1.0
![Page 15: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/15.jpg)
Evaluating Complex SQL on PDBs 1512/8/2006
Three Simple Rules: Rule 1
0.0 1.0
Pick a “Double Crosser” OPT must pick this too
![Page 16: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/16.jpg)
Evaluating Complex SQL on PDBs 1612/8/2006
Three Simple Rules: Rule 2
All lower/upper crossers then maximal OPT must pick this too
0.0 1.0
![Page 17: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/17.jpg)
Evaluating Complex SQL on PDBs 1712/8/2006
Three Simple Rules: Rule 3
Pick an upper and a lower crosser OPT may only pick 1 of these two
0.0 1.0
![Page 18: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/18.jpg)
Evaluating Complex SQL on PDBs 1812/8/2006
Multisimulation is a 2-Approx
Thm: Multisimulation performs at most twice as many simulations as OPT And, no deterministic algorithm can do better on every
instance. Extensions
Top-K Set (shown) Anytime (produce from 1 to k) Rank (produce top k ranked) All ( rank all intervals )
![Page 19: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/19.jpg)
Evaluating Complex SQL on PDBs 1912/8/2006
Overview
Motivating Example Query Processing Background Multisimulation Experimental Results
![Page 20: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/20.jpg)
Evaluating Complex SQL on PDBs 2012/8/2006
Experiment Details: Uncertain tuples
Table # TuplesStringMatch 339k
ActorMatch 6,758k
DirectorMatch 18k
Table # TuplesReviews 292k
![Page 21: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/21.jpg)
Evaluating Complex SQL on PDBs 2112/8/2006
Running Time
![Page 22: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/22.jpg)
Evaluating Complex SQL on PDBs 2212/8/2006
Running Time
“Find all years in which Anthony Hopkins was in a highly rated movie” (SS)Small Number of Tuples Output (33)
Small DNFs per Output
(Avg. 20.4, Max 63)
![Page 23: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/23.jpg)
Evaluating Complex SQL on PDBs 2312/8/2006
Running Time
“Find all directors who have a highly rated drama but low rated comedy” (LL)Large #Tuples Output (1415)
Large DNFs per Output
(Avg. 234.8, Max. 9088)
![Page 24: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/24.jpg)
Evaluating Complex SQL on PDBs 2412/8/2006
Conclusions
Mystiq is a general purpose probabilistic database
Multisimulation and Logical Optimization key to performance on large data sets
Advert: Demo on my laptop
![Page 25: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/25.jpg)
Evaluating Complex SQL on PDBs 2512/8/2006
Running Time“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL)Small Number of Tuples Output (33)
Large DNFs per Output
(Avg. 117.7,Max 685)
![Page 26: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/26.jpg)
Evaluating Complex SQL on PDBs 2612/8/2006
Running Time“Find all directors in the 80s who had a highly rated movie” (LS)Large #Tuples Output (3259)
Small DNFs per Output
(Avg 3.03, Max 30)
![Page 27: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/27.jpg)
Evaluating Complex SQL on PDBs 2712/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
![Page 28: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/28.jpg)
Evaluating Complex SQL on PDBs 2812/8/2006
0.0 1.0
Christopher Walken
Harvey Keitel
Samuel L. Jackson
Bruce Willis
1
3
4
2
![Page 29: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/29.jpg)
Evaluating Complex SQL on PDBs 2912/8/2006
0.0 1.0
![Page 30: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/30.jpg)
Evaluating Complex SQL on PDBs 3012/8/2006
0.0 1.0
![Page 31: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/31.jpg)
Evaluating Complex SQL on PDBs 3112/8/2006
0.0 1.0
![Page 32: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/32.jpg)
Evaluating Complex SQL on PDBs 3212/8/2006
0.0 1.0
![Page 33: Top-K Query Evaluation on Probabilistic Data](https://reader033.vdocuments.us/reader033/viewer/2022051702/56816692550346895dda7173/html5/thumbnails/33.jpg)
Evaluating Complex SQL on PDBs 3312/8/2006
0.0 1.0