efficient query evaluation on probabilistic databases
DESCRIPTION
Efficient Query Evaluation on Probabilistic Databases. Papers by Nilesh Dalvi , Dan Suciu , Chris Re. Outline. Motivation Definitions through examples Evaluation Complexity. Motivation. Imprecise information on the web Partial Information Contradictions Imprecise queries. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/1.jpg)
Efficient Query Evaluation on Probabilistic Databases
Papers by Nilesh Dalvi, Dan Suciu, Chris Re
![Page 2: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/2.jpg)
Outline
• Motivation
• Definitions through examples
• Evaluation
• Complexity
![Page 3: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/3.jpg)
Motivation
• Imprecise information on the web
• Partial Information
• Contradictions
• Imprecise queries
![Page 4: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/4.jpg)
Imprecise Querying
![Page 5: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/5.jpg)
Interpreting the ‘~’
• For the actors name we can use edit distance, frequency similarity measures…
• For the films rating we can use user preferences, analysis of previous queries,…
• But how to combine them?• And how to assign a score for a tuple w.r.t. the entire
query?
![Page 6: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/6.jpg)
Probabilistic DB
• Each tuple has a probability of appearing in the DB
• Assume tuple independence• Distribution over all possible DB instances• Possible Worlds Semantics
![Page 7: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/7.jpg)
Example
![Page 8: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/8.jpg)
Semantics
• A query is evaluated on every possible world• Note that for each concrete world, the query
may have several answers• In this case, sum, for each answer, the
probabilities of the worlds in which it appeared in the set of answers
• Example
![Page 9: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/9.jpg)
Example (Join on B=C)
![Page 10: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/10.jpg)
Another Example (join and projection on A)
![Page 11: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/11.jpg)
Solution attempt
• Obtain a query plan• Compute intermediate results along with
probabilities• A plan in our (first) example: First compute
the join, then project on D
![Page 12: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/12.jpg)
Evaluation of the plan
![Page 13: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/13.jpg)
Wrong!
• The tuples in the original DB were independent
• The tuples in the intermediate DB are not!
• Thus the multiplication (for the projection) is incorrect.
![Page 14: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/14.jpg)
The problem is hard
• Theorem: Answering a query over a general probabilistic DB is #P-hard (Data Complexity)
• #P-hard is the “equivalent” of NP-hard for functional problems
• E.g. #SAT - given a Boolean formula, compute how many satisfying assignments it has.
• Likely not to have a polynomial solution
![Page 15: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/15.jpg)
Other plans
• Some query plans are OK• These are plans that preserve independencies• Let us represent the query as a logical formula• Tuples that support the answer ‘p’ satisfy: (s1 or s2) and t1
![Page 16: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/16.jpg)
Plans and formulas
• The query was P((s1 or s2) and t1)• First join, then project corresponds to P((s1 and t1) or (s2 and t1)).
This conversion is fine in classic DBBut (s1 and t1), (s2 and t1) are not independent
events!
![Page 17: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/17.jpg)
Safe Plan
• A plan that preserves independencies is called safe
• In our example: first project s over b, only then join with t
• = first compute the ‘OR’, then the ‘AND’
![Page 18: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/18.jpg)
Safe Plan
![Page 19: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/19.jpg)
Intuition on evaluation
• Work with probabilistic events
• Carry the events during evaluation
![Page 20: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/20.jpg)
Probabilistic Events
• Atomic events tuples in the original DB
• Complex Events – boolean combination of events tuples in intermediate DBs
• Translate a query plan to a complex event
![Page 21: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/21.jpg)
Translation
![Page 22: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/22.jpg)
Translating events to probabilities(Works iff the DB preserves
independence!)
![Page 23: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/23.jpg)
Safe Plans
• A relational algebra expression has multiple equivalent expressions
• Each corresponds to a concrete execution plan.
• Some of these plans may correspond to correct or incorrect probabilistic computations
• Let us try to detect what makes a plan safe.s
![Page 24: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/24.jpg)
So what can we do?
• 1. Compute a safe plan when there is one• 2. Compute an approximation when not
![Page 25: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/25.jpg)
Approximation
• Most common is called Monte-carlo approximation
• Originally by Karp, improved in [suciu07]
• Guarantees convergence
• The error is greater than e with a probability of less than d after (4*n / e^2)* ln(2/d)
![Page 26: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/26.jpg)
Functional Dependencies (FDs)
• A functional dependency {A1,…An} -> B holds for a relation R if the values of the A1,…An decide the value of B
![Page 27: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/27.jpg)
Safe plans using FDs
• Selections and joins (over conjunctive queries) are always safe (but may cause unsafe successions..)
• Projection of a1,…,ak over the result obtained from q is safe if for every R, there is an FD a1,...,ak -> Head(q)
Where Head(q) are the attributes in the result of q
![Page 28: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/28.jpg)
Intuition• Projection over a1,…,an OR over all tuples that have the
same values of {a1,…,an}
• To be independent, each atomic event must be sufficient to distinguish tuples that are ORed (otherwise it appears in more than one)
• I.e it uniquely determines the other atomic events appearing in the tuple
• Hence the FD (valid only in combination with a1,…,an)
![Page 29: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/29.jpg)
Conjunctive Queries and Union thereof
• Whiteboard discussion
![Page 30: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/30.jpg)
Safe Plan algorithm
• Top-Down• Push all safe projections late in the plan• When you can’t, split the query q into two
sub-queries q1 and q2 such that their join is q (when possible)
• If stuck, the query is unsafe
![Page 31: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/31.jpg)
(Union of )Conjunctive Queriesby example
• T(x):- R(x,y),S(y,30)• T(x):- P(x,y)
• In relational algebra?– Multiple Possible translations– Correspond to different ordering of operations– Each option is called a “query plan”
![Page 32: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/32.jpg)
More notations• Head(q) is the set of head variables in q,
FreeVar(q) is the set of free variables (i.e. non-head variables) in q
• R.Key is the set of variables in the key position of the relation R
• R.NonKey is the set of variables in the non-key positions of the relation R,
• R.Pred is the predicate that q applies to R. For x in FreeVar(q), denote qx a new query whose body is identical with q and where
Head(qx) = Head(q) U {x}.
![Page 33: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/33.jpg)
![Page 34: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/34.jpg)
![Page 35: Efficient Query Evaluation on Probabilistic Databases](https://reader036.vdocuments.us/reader036/viewer/2022062500/5681540f550346895dc20e81/html5/thumbnails/35.jpg)
Conclusion
• Probabilistic DB is a very strong tool• Combines the exact semantics of classic DB
with capabilities of IR• Exact evaluation becomes hard sometimes• But have good approximations (with bounds!)