On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases
Presented by
Xi Zhang
Feburary 8th, 2008
Outline
Background Probabilistic database model Top-k queries & scoring functions
Motivation Examples Top-k Queries in Probabilistic Databases Conclusion
Probabilistic Databases
Motivation Uncertainty/vagueness/imprecision in data
History Imcomplete information in relational DB [Imielinski & Lipski
1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.]
Comeback Flourish of uncertain data in real world application
Examples: WWW, Biological data, Sensor network etc.
Probabilistic Database Model [Fubr & Rölleke 1997]
Probabilisitc Database Model A generalizaiton of relational DB
Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra
DocNo Term
1
2
3
3
4
IR
DB
IR
DB
AI
Prob
0.9
0.7
0.8
0.5
0.8
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
A Table in Probabilistic Database
Event expression
Independent events
Probabilistic Relational Algebra
Just like in Relational Algebra… Selection Projection Join Union Difference -
Probabilistic Relational Algebra
Just like in Relational Algebra… Selection Projection Join Union Difference -
DocNo Term
1
2
3
3
4
IR
DB
IR
DB
AI
Prob
0.9
0.7
0.8
0.5
0.8
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
Selection
DocNo Term
1
3
IR
IR
Prob
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
In derived table
Propositional expression of basic events
DocNo Term
1
2
3
3
4
IR
DB
IR
DB
AI
Prob
0.9
0.7
0.8
0.5
0.8
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
Projection
Term
IR
DB
AI
Prob
0.98
0.85
0.80
Complex Event
eDT(1, IR) eDT(3, IR)
eDT(2, DB) eDT(2, DB)
eDT(4, AI)
Join
DocNo Term
1
2
IR
DB
Prob
0.9
0.7
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
DocNo AName
1
2
Bauer
Meier
Prob
0.9
0.8
Basic Event
eDU(1, Bauer)
eDU(2, Meier)
DocAu:
DocAu.DocNo
AName DocTerm.DocNo
Term
1
1
2
2
Bauer
Bauer
Meier
Meier
1
2
1
2
IR
DB
IR
DB
Prob
0.9*0.9
0.9*0.7
0.8*0.9
0.8*0.7
Complex Event
eDU(1, Bauer) eDT(1, IR)
eDU(1, Bauer) eDT(2, DB)
eDU(2, Meier) eDT(1, IR)
eDU(2, Meier) eDT(2, DB)
DocNo Term
1
2
3
3
4
IR
DB
IR
DB
AI
Prob
0.9
0.7
0.8
0.5
0.8
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
Join + Projection
DocNo
1
3
Prob
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
IR:
DocNo
2
3
Prob
0.7
0.5
Complex Event
eDT(2, DB)
eDT(3, DB)
DB:
DocNo AName
1
2
2
2
3
4
4
Bauer
Bauer
Meier
Schmidt
Schmidt
Koch
Bauer
Prob
0.9
0.3
0.9
0.8
0.7
0.9
0.6
Basic Event
eDU(1, Bauer)
eDU(2, Bauer)
eDU(2, Meier)
eDU(2, Schmidt)
eDU(3, Schmidt)
eDU(3, Koch)
eDU(3, Bauer)
DocAu:
AName
Bauer
Schimdt
AName
Bauer
Meier
Schmidt
Prob
0.81
0.56
Complex Event
eDU(1, Bauer) eDT(1, IR)
eDU(3, S) eDT(3, IR)
Prob
0.21
0.63
0.91
Complex Event
eDU(2, Bauer) eDT(2, DB)
eDU(2, Meier) eDT(2, DB)
(eDU(2, S) eDT(2, DB))
(eDU(3, S) eDT(3, DB) )
AName
Bauer
Schmidt
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
Prob Complex Event
(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
(eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368
DocNo Term
1
2
3
3
4
IR
DB
IR
DB
AI
Prob
0.9
0.7
0.8
0.5
0.8
DocTerm:
Basic Event
eDT(1, IR)
eDT(2, DB)
eDT(3, IR)
eDT(3, DB)
eDT(4, AI)
DocNo
1
3
Prob
0.9
0.8
Complex Event
eDT(1, IR)
eDT(3, IR)
IR:
DocNo
2
3
Prob
0.7
0.5
Complex Event
eDT(2, DB)
eDT(3, DB)
DB:
DocNo AName
1
2
2
2
3
4
4
Bauer
Bauer
Meier
Schmidt
Schmidt
Koch
Bauer
Prob
0.9
0.3
0.9
0.8
0.7
0.9
0.6
Basic Event
eDU(1, Bauer)
eDU(2, Bauer)
eDU(2, Meier)
eDU(2, Schmidt)
eDU(3, Schmidt)
eDU(3, Koch)
eDU(3, Bauer)
DocAu:
AName
Bauer
Schimdt
AName
Bauer
Meier
Schmidt
Prob
0.81
0.56
Complex Event
eDU(1, Bauer) eDT(1, IR)
eDU(3, S) eDT(3, IR)
Prob
0.21
0.63
0.91
Complex Event
eDU(2, Bauer) eDT(2, DB)
eDU(2, Meier) eDT(2, DB)
(eDU(2, S) eDT(2, DB))
(eDU(3, S) eDT(3, DB) )
AName
Bauer
Schmidt
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
Prob Complex Event
(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
(eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368
Intensional Semanticsv.s.
Extensional Semantics
Join + Projection
Intensional v.s Extensional
Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the
evaluation Extensional Semantics
Assume data independence during the evaluation Could be WRONG with probability computation!
When Intensional = Extensional?
No identical underlying basic events in the event expression
AName
Bauer
Schmidt
Prob
0.81 * 0.21 = 0.1701
0.56 * 0.91 = 0.5096
Complex Event
(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))
(eDU(3, S) eDT(3, IR) )
( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368
Identical basic event
Fubr & Rölleke 1997
Summary Probabilisitc DB Model
Concept of event Basic v.s. complex event Event expression
Probabilistic Relational Algebra Just like in Relational Algebra…
Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event
expressions
Outline
Background Probabilistic database model Top-k queries & scoring functions
Motivation Examples Top-k Queries in Probabilistic Databases
Semantics Query Evaluation
Conclusion
Top-k Queries
Traditonally, givenObjects: o1, o2, …, on
An non-negative integer: k
A scoring function s:
Question:
What are the k objects with the highest score?
Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.
Outline
Background Motivation Examples
Smart Enviroment Example Sensor Network Example
Top-k Queries in Probabilistic Databases Conclusion
Motivating Example I
Smart Environment Sample Question
“Who were the two visitors in the lab last Saturday night?” Data
Biometric data from sensors We would be able to see how those data match the profile of every
candidate -- a scoring function Historical statistics
e. g. Probability of a certain candidate being in lab on Saturday nights
Motivating Example I (cont.)
Face Voice Detection, Detection,
Aiden score( 0.70 , 0.60, … ) = 0.65
Bob score( 0.50 , 0.60, … ) = 0.55
Chris score( 0.50 , 0.40, … ) = 0.45
Probability of being in lab on Saturday nights
0.3
0.9
0.4
PersonnelBiometrics
score( … )
Question: Find two people in the lab last Saturday night
a Top-2 query over the above probabilistic database under the above scoring function
Motivating Example II
Sensor Network in a Habitat Sample Question
“What is the temperature of the warmest spot?” Data
Sensor readings from different sensors At a sampling time, only one “real” reading from a
sensor Each sensor reading comes with a confidence value
Motivating Example II (cont.)
Temp (F)
22
10
25
15
Prob
0.6
0.1
Question: What is the temperature of the warmest spot?
a Top-1 query over the above probabilistic database under the scoring function proportional to temperature
0.4
0.6
C1 (from Sensor 1)
C2 (from Sensor 2)
Outline
Background Motivation Examples Top-k Queries in Probabilistic Databases
Semantics Query Evaluation
Conclusion
Models A probabilistic relation Rp=<R, p, >
R: the support deterministic relation p: probability function : a partition of R, such that
Simple v.s. General probabilistic relation Simple
Assume tuple independence, i.e. ||=|R| E.g. smart environment example
General Tuples can be independent or exclusive, i.e. ||<|R| E.g. sensor network example
Challenges
Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R
No ties A non-negative integer k
What is the top-k answer set over Rp ? (Semantics)
How to compute the top-k answer of Rp ? (Query Evaluation)
Properties
Exact-k If R has at least k tuples, then exactly k tuples are returned
as the top-k answer
Faithfulness A “better” tuple, i.e. higher in score and probability, is more
likely to be in the top-k answer, compared to a “worse” one
Stability Raising the score/prob. of a winning tuple will not cause it
to lose Lowering the score/prob. of a losing tuple will not cause it
to win
Global-Topk Semantics
Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R
No ties A non-negative integer k
What is the top-k answer set over Rp ? (Semantics) Global-Topk
Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds
Global-Topk satisfies aforementioned three properties
Smart Environment Example
Score( 0.50 , 0.40, … ) = 0.45Chris
Score( 0.50 , 0.60, … ) = 0.55Bob
Score( 0.70 , 0.60, … ) = 0.65Aiden
Face Voice Detection, Detection,
Prob.
0.3
0.9
0.4
PersonnelBiometrics
Score( … )
Query: Find two people in lab on last Saturday night
Aiden
Bob
Chris
Aiden Bob ChrisAiden
Bob
Aiden
Chris
Bob
Chris
0.0180.042 0.378 0.028 0.1620.108
0.2520.012
Top-2
possible worlds
Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292
Global-Topk Semantics:
Pr(Aiden in top-2) = 0.3Pr(Bob in top-2) = 0.9 Top-2 Answer
U-Topk Semantics
Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R
No ties A non-negative integer k
What is the top-k answer set over Rp ? (Semantics) U-Topk
Return the most probable top-k answer set that belongs to possible worlds
U-Topk does not satisfies all three properties
Smart Environment Example
Score( 0.50 , 0.40, … ) = 0.45Chris
Score( 0.50 , 0.60, … ) = 0.55Bob
Score( 0.70 , 0.60, … ) = 0.65Aiden
Face Voice Detection, Detection,
Prob.
0.3
0.9
0.4
PersonnelBiometrics
Score( … )
Query: Find two people in lab on last Saturday night
Aiden
Bob
Chris
Aiden Bob ChrisAiden
Bob
Aiden
Chris
Bob
Chris
0.0180.042 0.378 0.028 0.1620.108
0.2520.012
Top-2
possible worlds
Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27
U-Topk Semantics:
…Pr({Bob}) = 0.378 Top-2 Answer
U-kRanks Semantics
Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R
No ties A non-negative integer k
What is the top-k answer set over Rp ? (Semantics) U-kRanks
For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds
U-kRanks does not satisfies all three properties
Smart Environment Example
Score( 0.50 , 0.40, … ) = 0.45Chris
Score( 0.50 , 0.60, … ) = 0.55Bob
Score( 0.70 , 0.60, … ) = 0.65Aiden
Face Voice Detection, Detection,
Prob.
0.3
0.9
0.4
PersonnelBiometrics
Score( … )
Query: Find two people in lab on last Saturday night
Aiden
Bob
Chris
Aiden Bob ChrisAiden
Bob
Aiden
Chris
Bob
Chris
0.0180.042 0.378 0.028 0.1620.108
0.2520.012
Top-2
possible worlds
e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292
U-kRanks Semantics:
Top-2 Answer{Bob}
Aiden Bob
Rank-1
Rank-2
0.3
0
0.63
0.27
0.028
0.264
Chris Highest at rank-1
Highest at rank-2
Properties
Semantics Exact-k Faithfulness Stability
Global-Topk
U-Topk
U-kRanks
Yes
No
No
Yes
Yes/No*
No
Yes
Yes
No
* Yes when the relation is simple, No otherwise
A better sementics
Challenges
Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R
No ties A non-negative integer k
What is the top-k answer set over Rp ? (Semantics)
How to compute the top-k answer of Rp ? (Query Evaluation)
Global-Topk
Global-Topk in Simple Relation
Given Rp=<R, p, >, a scoring function s, a non-negative integer k Assumptions
Tuples are independent, i.e. ||=|R| R={t1,t2,…tn}, ordered in the decreasing order of their
scores, i.e.
Global-Topk in Simple Relation
Query Evaluation Recursion
Pk,s(ti): Global-Topk probability of tuple ti
Dynamic Programming
Optimization Threshold Algorithm (TA)
[Fagin & Lotem 2001] Given a system of objects, such that
For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute
An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm)
f is monotonic f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i
TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g.
top-k, skyline, etc.
Applying TA Optimization
Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability
Global-Topk in General Relation
Given Rp=<R, p, >, a scoring function s, a non-negative integer k Assumptions
Tuples are independent or exclusive, i.e. ||<|R| R={t1,t2,…tn}, ordered in the decreasing order of their
scores, i.e.
Global-Topk in General Relation
Induced Event Relation For each tuple in R, there is a probabilistic relation
Ep=<E, pE, E> generated by the following two rules
Ep is simple
Sensor Network Example
Temp (F)
22
10
25
15
Prob
0.6
0.1
0.4
0.6
C1 (from Sensor 1)
C2 (from Sensor 2)
15 0.6
Event
teC1
tet
0.6 =
0.6 = p(t)
For example:
Induced Event Relation (simple)
t=
where i=1
Prob
Rule 1
Rule 2
Prob. Relation (general)
Evaluating Global-Topk in General Relation
For each tuple t, generate corresponding induced event relation
Compute the Global-Topk probability of t by Theorem 4.3
Pick the k tuples with the highest Global-Topk probability
Summary on Query Evaluation
Simple (Independent Tuples) Dynamic Programming
Tuples are ordered on their scores Recursion on the tuple index and k
General (Independent/Exclusive Tuples) Polynomial reduction to simple cases
Complexity
Global-Topk
U-Topk U-kRanks
Simple O(kn) O(kn) O(kn)
General O(kn2) Θ(mknk-1 lg n)* Ω(mnk-1)*
* m is a rule engine related factor m represents how complicated the relationship between tuples could be
Conclusion
Three intuitive semantic properties for top-k queries in probability databases
Global-Topk semantics which satisfies all the properties above
Query evaluation algorithm for Global-Topk in simple and general probabilistic databases
Future Problems Weak order scoring function
Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary
tie breaker”) Preference Strength
Sensitivity to Score Given a prob. relation Rp, if the DB is sufficiently large, by
manipulating the scores of tuples, we would be able to get different answers
NOT satisfied by our semantics NOT satisfied by any semantics in literature
Need to consider preference strength in the semantics
Related Works
Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational
Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases
On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008]
Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]