top-k query evaluation on probabilistic data christopher ré, nilesh dalvi and dan suciu university...

Post on 16-Dec-2015

218 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Top-K Query Evaluation on Probabilistic Data

Christopher Ré, Nilesh Dalvi and Dan Suciu

University of Washington

Evaluating Complex SQL on PDBs 212/8/2006

High Level Overview

DBMS: Precise answers over clean data Data are often imprecise

Information Integration Information Extraction

Probabilistic DB (PDB) handle imprecisionMany low quality answersTop-K ranked by probability

This talk: Compute Top-K Efficiently

Evaluating Complex SQL on PDBs 312/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs 412/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs 512/8/2006

Example Application

IMDB

• Lots of interesting data above movies (e.g. actors, directors)

• Well maintained and clean

• But no reviews!

On the web there are lots of reviews

How will I know which movie they

are about?

Alice needs to do information extraction and object reconcillation.

Is a movie good or bad?

Alice wants to do sentiment analysis.

A probabilistic database can help Alice store and query her uncertain data.

Find all years where ‘Anthony Hopkins’ starred in a good

movie

Evaluating Complex SQL on PDBs 612/8/2006

Imprecision is out there…Object Reconciliation

RID Title

r124 12 Monkeys

r155 Twelve Monkeys

r175 2 Monkey

r194 Monk

MID Title

m232 12 Monkeys

m143 Monkey Love

Our Approach: Convert scores to probabilities

Data extracted from Reviews

Clean IMDB Data

Output: (RID,MID) pairs

12/8/2006

MatchNo Match

t’ t

Felligi-Sunter Approach: Score (s) each (RID,MID)

Evaluating Complex SQL on PDBs 712/8/2006

Imprecision is out there…

Object Reconciliation

RID Title

r124 12 Monkeys

r155 Twelve Monkeys

r175 2 Monkey

r194 Monk

MID Title

m232 12 Monkeys

m143 Monkey Love

RID MID Prob

r175 m232 0.8

r175 m143 0.2

Felligi-Sunter Approach: Score (s) each (RID,MID)

MatchNo Match

t’ t

Evaluating Complex SQL on PDBs 812/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs 912/8/2006

Query Processing Background

RID MID Probr175 m232 0.8

r175 m143 0.2

Query Processing builds event expression

• Intensional Query Processing [FR97]

• Associate to each tuple an event

• Probability event is satisfied = query value

Technical Point: Projection as last operator implies result is a DNF

Evaluating Complex SQL on PDBs 1012/8/2006

DNF Sampling at a High Level

Estimate p(t),probability DNF sat satisfied Do for each output tuple, t#P-Hard [Valiant79] even if only conjunctive

queries [RDS06,DS04]Randomized Approximation [LK84]

Simulation reduces uncertainty

0.0 1.0Uncertain about p(t)

Evaluating Complex SQL on PDBs 1112/8/2006

Naïve Query Processing

Naïve algorithm (PTIME): Simulate until all small “Epsilon”-small

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Can we do better?

Evaluating Complex SQL on PDBs 1212/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs 1312/8/2006

A Better Method: Multisimulation Separate Top-K with few simulations

Concentrate on intervals in Top-K Asymptotically, confidence intervals are nested

Compare against OPT “knows” which intervals to simulate

Evaluating Complex SQL on PDBs 1312/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Evaluating Complex SQL on PDBs 1412/8/2006

The Critical Region

The critical region is the interval (kth-highest min, k+1st higest max) For k = 2

0.0 1.0

Evaluating Complex SQL on PDBs 1512/8/2006

Three Simple Rules: Rule 1

0.0 1.0

Pick a “Double Crosser” OPT must pick this too

Evaluating Complex SQL on PDBs 1612/8/2006

Three Simple Rules: Rule 2

All lower/upper crossers then maximal OPT must pick this too

0.0 1.0

Evaluating Complex SQL on PDBs 1712/8/2006

Three Simple Rules: Rule 3

Pick an upper and a lower crosser OPT may only pick 1 of these two

0.0 1.0

Evaluating Complex SQL on PDBs 1812/8/2006

Multisimulation is a 2-Approx

Thm: Multisimulation performs at most twice as many simulations as OPT And, no deterministic algorithm can do better on every

instance. Extensions

Top-K Set (shown) Anytime (produce from 1 to k) Rank (produce top k ranked) All ( rank all intervals )

Evaluating Complex SQL on PDBs 1912/8/2006

Overview

Motivating Example Query Processing Background Multisimulation Experimental Results

Evaluating Complex SQL on PDBs 2012/8/2006

Experiment Details: Uncertain tuples

Table # Tuples

StringMatch 339k

ActorMatch 6,758k

DirectorMatch 18k

Table # Tuples

Reviews 292k

Evaluating Complex SQL on PDBs 2112/8/2006

Running Time

Evaluating Complex SQL on PDBs 2212/8/2006

Running Time

“Find all years in which Anthony Hopkins was in a highly rated movie” (SS)

Small Number of Tuples Output (33)

Small DNFs per Output

(Avg. 20.4, Max 63)

Evaluating Complex SQL on PDBs 2312/8/2006

Running Time

“Find all directors who have a highly rated drama but low rated comedy” (LL)

Large #Tuples Output (1415)

Large DNFs per Output

(Avg. 234.8, Max. 9088)

Evaluating Complex SQL on PDBs 2412/8/2006

Conclusions

Mystiq is a general purpose probabilistic database

Multisimulation and Logical Optimization key to performance on large data sets

Advert: Demo on my laptop

Evaluating Complex SQL on PDBs 2512/8/2006

Running Time“Find all actors in Pulp Fiction who appeared in two very bad movies in the five years before appearing in Pulp Fiction” (SL)

Small Number of Tuples Output (33)

Large DNFs per Output

(Avg. 117.7,Max 685)

Evaluating Complex SQL on PDBs 2612/8/2006

Running Time“Find all directors in the 80s who had a highly rated movie” (LS)

Large #Tuples Output (3259)

Small DNFs per Output

(Avg 3.03, Max 30)

Evaluating Complex SQL on PDBs 2712/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

Evaluating Complex SQL on PDBs 2812/8/2006

0.0 1.0

Christopher Walken

Harvey Keitel

Samuel L. Jackson

Bruce Willis

1

3

4

2

Evaluating Complex SQL on PDBs 2912/8/2006

0.0 1.0

Evaluating Complex SQL on PDBs 3012/8/2006

0.0 1.0

Evaluating Complex SQL on PDBs 3112/8/2006

0.0 1.0

Evaluating Complex SQL on PDBs 3212/8/2006

0.0 1.0

Evaluating Complex SQL on PDBs 3312/8/2006

0.0 1.0

top related