probabilities in databases and logics i nilesh dalvi and dan suciu university of washington

Probabilities in Databases and Logics I Nilesh Dalvi and Dan Suciu University of Washington 2 Two Lectures Today: probabilistic database to model imprecisions probabilistic logics Tomorrow: probabilistic database to model incompletness random graphs 3 Motivation Record reconciliation Information extraction Constraint violations Schema matching 4 Databases 101 Tables: titleyear Twelve Monkeys1995 Monkey Love1997 Monkey Love1935 Monkey Love Planet2005 Queries: SELECT title, rating FROM Movie, Review WHERE title = name Answers: titlerating Twelve Monkeys3 Monkey Love5 Monkey Love Planet5 Movie 5 nameratingp Monkey Lovegood.5 fair.2 fair.6 poor.9 Review Queries: A(x,y) :- Review(x,y), Movie(x,z), z > 1991 Movie(x,z), z > 1991 Problem Setting Tables: titleyearp Twelve Monkeys Monkey Love Monkey Love Monkey Love Pl Answers: titleratingp Twelve Monkeysfair.53 Monkey Lovegood.42 Monkey Love Plfair.15 Movie Top k Problem: complexity of query evaluation 6 Two Problems Fix answer tuple (a,b) Given database I, compute Pr(Q(a,b)) Query evaluation problem Fixed schema S, conjunctive query Q(x,y) Fix k > 0 Given database I, find k answer tuples with highest probabilities Top-k answering problem 7 Related Work: DB Cavallo&Pitarelli:1987 Barbara,Garcia-Molina, Porter:1992 Lakshmanan,Leone,Ross&Subrahmanian:199 7 Fuhr&Roellke:1997 Dalvi&S:2004 Widom:2005 8 Related Work: Logic Query reliability [Gradel,Gurevitch,Hirsch98] Degrees of belief [Bacchus,Grove,Halpern,Koller96] Probabilistic Logic [Nielson] Probabilistic model checking [Kwiatkowska02] Probabilistic Relational Model [Taskar,Abbeel,Koller02] 9 Outline Definitions Query Evaluation Top-k answering (joint with Chris Re) Conclusions 10 Application 1: Record Linkage Title Monk 12 Monkeys Twelve Monkey Movies Data cleaning remains expensive, critical: Prob dbs: garbage in, ranked answers out Extensive research area Which records match ? Review Twelve Monkeys 12 Monkeys (1996) Monkey Love Planet of the Apes Reviews ? Today: garbage in, garbage out 11 Application 1: Fuzzy Object Matching TitleMatch titlereviewp Twelve Monkeys12 Monkeys0.7 Monkey Love Monkeys0.45 Monkey Love 1935Monkey Love0.82 Monkey Love 1935Monkey Boy0.68 Monkey Love PlanetMonkey Love0.8 Table q-gram or edit dist. 12 Application 1: Fuzzy Object Matching Intensively studied: record linkage, deduplication, object reconciliation, etc. Current usage: score v.s. threshold New usage: scores as probabilities 13 Application 1: Fuzzy Object Matching Queries: Find all movies rated highly by both Joe and Jim titleyearp Monkey Love Twelve Monkeys Monkey Love Monkey Love Planet Answers: top k 14 Application 2: Information Extraction Collection of unstructured documents Define tables, populate them scores Variety machine learning techniques 15 Application 2: Information Extraction Posted By Review Text Monkeys is an OK movie is one of the best movies I've seen never seen 12 Monkeys but I love Monk. Which movie is the review about ? Is this review positive or negative ? 16 Application 2: Information Extraction I've never seen 12 Monkeys but I love Monk. Review text: Extensive research area Probabilistic dbs: SQL queries, rank answers MovieActorRating 12 MonkeysMonkfair Facts Table: Avatar (IBM)documentscorporate facts Textrunner (UW) 100M Web pages 1BN facts 0.3 Today: facts can be used only in isolation 17 Application 2: Information Extraction Extensive area: From text segmentation to extraction from WWW AVATAR (IBM): corporate data from docs TextRunner (UW): 1,000,000,000 facts from WWW ATTENEX (startup): discovery for law offices Inherently imprecise ! Probabilistic dbs: keep and use scores 18 Application 3: Activity Recognition Probabilistic dbs: use scores to rank answers NameTimeActivity Suetrun | walk Suet+1walk | stand | sit [Lester05,Liao05] Equip People with sensors, classify activity Elderly health care (e.g. Alzheimer's) Has Mr. Johnson eaten in the last 12 hours? 19 Application 3: Activity Recognition subjecttimeactp Barbaratrun0.3 Barbarateat0.45 Jimt+1eat0.82 Barbarat+1eat0.45 Table 20 Other Applications Sensor data, RFID [Madden] Constraint violations [Bertossi02,Fuxman06] Schema matching [Doan03] Security/privacy [Evfimievski03, Miklau04] Bio-, medical-, clinical informatics [DataGrid (a startup)] Personal information management [Karger, Halevy] 21 Summary of Applications Large range of apps with imprecise data Specific techniques exists Imprecisions handled at the application level Goal of probabilistic databases: manage all imprecisions uniformly 22 Outline Applications Data model Queries Multisimulation Conclusions 23 Pr : Inst [0,1], I Pr[I] = 1 Probabilistic Database Schema S, Domain D, Set of instances Inst Definition Probabilistic database is a probability distribution If Pr[I] > 0 then I is called possible world 24 Probabilistic Database Representation: Independent tuples: I-database DB over some schema S i Independent and disjoint tuples: ID-database DB over some schema S id Semantics: DB means probability distribution Pr over schema S 25 Independent Events A tuple is in the database with probability p Any two tuples are independent events 26 I-Databases MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 8 ] = 1 Reviews i (M,S,p) MovScor (1-p 1 )*(1-p 2 )*(1-p 3 ) Pr[I 1 ]= MovScor m42good m99good p 1 *p 2 *(1-p 3 ) Pr[I 4 ]= p 1 *p 2 *p 3 Pr[I 8 ]= Representation Possible worlds semantics Reviews(M,S) MovScor m42good m99good m76poor MovScor m99good MovScor m42good MovScor m m76poor MovScor m42good m76poor MovScor m76poor 27 Disjoint Events Needed in Many-to-1 matchings Possible values for attributes [Barbara92] NameAgeP John John Mary NameAge John 34 (0.3) 43 (0.7) Mary 25 28 ID-Databases Time d ActivityP t walk p1p1 t run p2p2 t+1 walk p3p3 Pr[I 1 ] + Pr[I 2 ] Pr[I 6 ] = 1 Activities id TimeActTimeAct t run TimeAct t walk t+1 walk TimeAct t walk TimeAct t+1 walk (1-p 1 -p 2 )*(1-p 3 ) Pr[I 1 ]= p 2 *(1-p 3 ) Pr[I 3 ]= p 1 *p 3 Pr[I 5 ]= Activities TimeAct t run t+1 walk 29 ID subsumes I Movie d Score d P m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews i = Note: MovieScoreP m42good p1p1 m99good p2p2 m76poor p3p3 Reviews id means all tuples are disjoint 30 Queries idyearP m m m m midratingp m m m m m Movie i Review i Q(y) :- Movie(x,y), Review(x,z), z >= 3 Syntax: conjunctive queries over schema S 31 Two Query Semantics Possible answer sets Given set A: Used for views Possible tuples Given tuple t: Used for query evaluation and top-k Pr[{t | I Q(t)} = A] Pr[I Q(t)] Thistalk 32 Possible Answer Sets ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5 33 p2p2 year p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Query Semantics 1 year p4p4 p 1 + p 3 year 2004 Q(y) :- Movie(x,y), Review(x,z) Possible answer setsUseful for Views 34 ID + Views = Complete Theorem. ID-databases plus views are complete for representing possible worlds distributions R A a c d f g p2p2 R A b c d g p3p3 R A a c p4p4 R A a b c p1p1 R A b c d f g p5p5 RA AW aw1w1 bw1w1 cw1w1 aw2w2 cw2w2 dw2w2 f... WP w1w1 p1p1 w2w2 p2p2 w3w3 p3p3 w4w4 p4p4 w5w5 p5p5 PWD di di= R(x) :- RA(x,w), PWD di (w) PrDB 35 p1p1 idyear m m m p2p2 idyear m m p4p4 idyear m m p3p3 idyear m m m Q(y) :- Movie(x,y), Review(x,z) top k yearp 1935p 2 + p 3 = p 1 + p 3 = p 3 = Query Semantics Tuple probabilities 36 Complex Correlations MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) From atomic events to complex events Views 37 Complex Correlations ActorMatch ActoramzAct TitleMatch TitleReview p2p2 ActorMatch ActoramzAct TitleMatch TitleReview p3p3 MoviewReviewMatch(mid,rid) :- Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Movie(mid,x,-),Review(rid,y,-,-),TitleMatch(x,y), Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) Actor(u,mid), ActorReview(v,rid),ActorMatch(u,v) MoviewReviewMatch midridmidrid MoviewReviewMatch midrid MoviewReviewMatch P 1 + P 2 P3P3 P 4 + P 5 + P 6 ActorMatch ActoramzAct TitleMatch TitleReview p4p4 ActorMatch ActoramzAct TitleReview p1p1 TitleMatch ActorMatch ActoramzAct TitleMatch TitleReview p5p5 38 Summary on Data Model Data Model: Semantics = possible worlds Syntax = I-databases or ID-databases Queries: Syntax = unchanged (conjunctive queries) Semantics = tuple probabilities 39 Outline Definitions Query evaluation Top-k answering Conclusions 40 Problem Definition Fix schema S, query Q, answer tuple t Problem: given I/ID-database DB, compute Pr[I Q(t)] Conventions: For upper bounds (P or #P): probabilities are rationals For lower bounds (#P): probabilities are 1/2 Pr[Q(t)] notation: 41 Query Evaluation on I-Databases Outline Intuition Extensional plans: PTIME case Hard queries: #P-complete case Dichotomy Theorem 42 Intuition Yearp p 1 (1 - (1 - q 1 ) (1 - q 2 )(1 - q 3 )) 1 - (1 - ) (1 - ) p 2 (1 - (1 - q 4 )(1 - q 5 )) p 3 q 6 idyearp m421995p1p1 m992002p2p2 m762002p3p3 m052005p4p4 midrate p m424 q1q1 2 q2q2 3 q3q3 m991 q4q4 3 q5q5 m76 5 q6q6 Movie i Review i Answer Q(y) :- Movie(x,y), Review(x,z) Review(x,z) 43 Add Join p = p 1 * p 2 Projection p = 1-(1-p 1 )(1-p 2 )...(1-p n ) Selection p = p Note: data complexity is PTIME p I-Extensional Plans [Barbara92,Lakshmanan97] 44 Extensional Query Plans xpxqx pq xp1 xp2 xp3 x 1-(1-p1)(1-p2)(1- p3) xp xp 45 Extensional Query Plans Each tuple t has a probability t.P Algebra operators compute t.P Data complexity: PTIME 46 Movie Review CORREC T INCORRECT! 1995m1pq m1pq m1pq (1-pq 1 )(1-pq 2 )(1- pq 3 ) Movie Review m1 1 - (1-q 1 )(1-q 2 )(1- q 3 ) 1995m1 p(1-(1-q 1 )(1-q 2 )(1- q 3 )) m1q1q1 q2q2 q3q3 1995p m1q1q1 q2q2 q3q3 1995p Q(y) :- Movie(x,y), Review(x,z) Review(x,z) 47 Observation 1 The answer depends on the query plan 48 Q bad :- R i (x), S(x,y), T i (y) Ap p1p1 p2p2 p3p3 p4p4 Bp q1q1 q2q2 q3q3 q4q4 AB RiRi STiTi Theorem: Data complexity is #P-complete #P-Complete Queries 49 Proof: Ap x1x1 1/2 x2x2 x3x3 x4x4 Bp y1y1 y2y2 y3y3 AB x2x2 y3y3 x1x1 y2y2 x4x4 y3y3 x3x3 y1y1 RiRi STiTi Reduction: x 2 y 3 V x 1 y 2 V x 4 y 3 V x 3 y 1 Q bad :- R i (x), S(x,y), T i (y) Theorem [Provan&Ball83] Counting the number of satisfying assignments for bipartite DNF is #P- complete 50 Observation 2 Some queries (like Q bad ) dont admit a correct extensional plan ! I-Dichotomy Definition 1. For each variable x: goals(x) = set of goals that contain x goals(x) = set of goals that contain x Q = boolean conjunctive query Definition 2. Q is hierarchical if forall x, y: (a) goals(x) goals(y) = , or (b) goals(x) goals(y), or (c) goals(y) goals(x) (a) goals(x) goals(y) = , or (b) goals(x) goals(y), or (c) goals(y) goals(x) 52 Q :- R(x),S(x,y),T(x,y,z),K(x,v) Q :- R(x), S(x,y), T(y) xy z R S T v K xy R S T hierarchical non-hierarchical 53 I-Dichotomy [Dalvi&S.04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan Q is hierarchical or: Q is #P-complete Q has subgoals R(x,...),S(x,y,...),T(y,...) Schema S i = {R 1 i, R 2 i,..., R m i } 54 Proof Lemma 1. If Q is non-hierarchical, then #P-complete Proof: x y R S T z K v Q :- R i (v,x), S i (x,y), T i (y,z), K i (z) rest is like for Q bad 55 Proof Lemma 2. If Q is hierarchical, then PTIME Proof: Case 1: has no root Pr(Q) = Pr(Q 1 ) Pr(Q 2 ) Pr(Q 3 ) This is extensional join 56 Proof Case 2: has root x x Pr(Q) = 1 - (1-Pr(Q(a 1 /x))(1-Pr(Q(a 2 /x))...(1-Pr(Q(a n /x))) This is an extensional projection: Dom={a 1, a 2,..., a n } QED 57 Query Evaluation on ID-Databases ID-extensional plans #P-complete queries Dichotomoy Theorem 58 Only difference: two kinds of projections: independent 1-(1-p 1 )...(1-p n ) disjoint p p n Extensional Plans for ID-DBs 59 #P-Complete Queries Q 2 :- R d (x d,y), S d (y d,z) Q 1 :- R i (x), S i (x,y), T i (y) Q 3 :- R d (x d,y), S d (z d,y) 60 I-DB Dichotomy [Dalvi&S.04] Theorem Let Q = conjunctive query w/o self-joins. Then one of the following holds: Q is in PTIME Q has a correct extensional plan or: Q is #P-complete Q has one of Q 1, Q 2, Q 3 as subqueries Schema S id s.t. each table is either R i or R id 61 Extensions Extensions of the dichotomoy theorem exists for: Mixed schemas (some relations are deterministic) Functional dependencies 62 Summary on Query Evaluation Extensional plans: popular, efficient, BUT Equivalent plans lead to different results Some queries admit correct plans Some simple queries: #P-complete complexity Dichotomy theorem Future work: remove no-self-join restriction 63 Summary on Queries Extensional plans: popular in the past But not all are correct Some queries have no correct ext. plans Need extensions to the DBMS [Barbara92,Lakshmanan97] [Dalvi&S.04] 64 Summary of Query Complexity Probabilistic databases have high complexity: #P Extensional plans: popular and efficient BUT: answer depends on the plan When , query has high complexity [Barbara92,Lakshmanan97] [Dalvi&S.04] 65 Outline Definitions Query evaluation Top-k answering (joint with Chris Re) Conclusions 66 Event Expressions Atomic events: e 1, e 2,... Probabilities: p 1, p 2,... Event expressions: e 1 e 2 e 1 e 3 67 Intensional Query Plans xp xq x pqpq xp1 xp2 xp3 x p1 p2 p3 xp xp [Fuhr97] 68 Probabilities of Boolean Expressions p 1 = Pr(e 1 ) p 2 = Pr(e 2 ) p 3 = Pr(e 3 ) Compute probability p = Pr(E) ? A: E = e 1 (e 2 e 3 ) p = p 1 (1-(1-p 2 )(1-p 3 )) Given E= e 1 e 2 e 1 e 3 69 Top-k Ranking Problem Fix schema S, query Q, number k > 0 Problem: given I- or ID-database DB, find k answers t 1,...,t k with highest probabilities Note: Checking Pr[Q(t i )] > Pr[Q(t j )] is PP-complete Goal: efficient polynomial time approximation Pr[Q(t 1 )] > Pr[Q(t 2 )] >.... > Pr[Q(t k )] >... 70 Probabilities of Boolean Expressions What is the probability of e 1 e 2 e 1 e 2 e 1 e 3 ? (1-p 1 )p 2 p 3 + p 1 (1-p 2 )p 3 + p 1 p 2 (1-p 3 ) + p 1 p 2 p 3 Theorem #P-hard [Valiant] Ap e1e1 p1p1 e2e2 p2p2 e3e3 p3p3 71 Monte Carlo Simulation Better: PTAS Pr( |p-p| 1- [Karp&Luby83] Algorithm: radomly pick each e 1, e 2, e 3 = false or true radomly pick each e 1, e 2, e 3 = false or true compute e 1 e 2 e 1 e 3 e 2 e 3 : true or false ? repeat compute e 1 e 2 e 1 e 3 e 2 e 3 : true or false ? repeat Approximate probability p with frequency p p p- p+ p 72 Monte Carlo Simulation N=0 01 p N=1 N=2 N=3 73 The Multisimulation Problem YearP 1995?? 2002?? 1933?? 1984?? Schedule simulation steps to find top-k 01 74 Multisimulation How to find the top k out of n ? Example: looking for top k=2; Which one simulate next ? p5p5 p1p1 p4p4 p2p2 p3p3 75 Multisimulation Critical region: (kth left, k+1th right) 01 k=2 76 Multisimulation Algorithm Case 1: pick a double crosser and simulate it 01 this k=2 77 Multisimulation Algorithm Case 2: pick both a left AND a right crosser k=2 01 this and this 78 Multisimulation Algorithm Case 3: pick a max crosser and simulate it 01 this k=2 79 Multisimulation Algorithm End: when critical region is empty 01 k=2 To sort the top k, find the top k-1, etc 80 Multisimulation Algorithm Theorem (1) It runs in < 2 Optimal # steps (1) It runs in < 2 Optimal # steps (2) no other deterministic algorithm does better (2) no other deterministic algorithm does better 81 Experiments IMDB+AMZN: about 10M tuples 60% probabilistic k=10n = simulate all multisim plus optimization engine time 82 Experiments 83 Summary on Top-k Answering Simple algorithm, optimal (x2) w.r.t. a very powerful standard Marriage of probabilistic and top-k answers make probabilistic databases practical 84 Experiments 85 Outline Definitions Query evaluation Top-k answering Conclusions 86 Related Work Probabilistic databases: Cavallo87, Barbara92, Laskhmanan97, Fuhr97, Dalvi04, Widom05 Extensional/intensional plans: Fuhr97 Probabilities for degrees of belief Fagin90, Bacchus96 Simulation of boolean functions: Karp&Luby Complexity of boolean function probability Valiant79 87 Conclusions Strong motivation from practical applications Opportunity to merge query and search technologies Probabilistic DBs are hard ! Great opportunity for impactful theory work Tomorrow: applications of random graphs to model incompleteness in databases 88 Research at UW finish the complexity dichotomy aggregate queries constraints incomplete databases (random graphs) Thank you ! Questions ?

probabilities in databases and logics i nilesh dalvi and dan suciu university of washington

Documents