on the semantics and evaluation of top-k queries in probabilistic databases presented by xi zhang...

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

Presented by

Xi Zhang

Feburary 8th, 2008

Outline

Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Outline

Background Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Probabilistic Databases

Motivation Uncertainty/vagueness/imprecision in data

History Imcomplete information in relational DB [Imielinski & Lipski

1984] Probabilistic DB model [Cavallo & Pittarelli 1987] Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.]

Comeback Flourish of uncertain data in real world application

Examples: WWW, Biological data, Sensor network etc.

Probabilistic Database Model [Fubr & Rölleke 1997]

Probabilisitc Database Model A generalizaiton of relational DB

Probabilistic Relational Algebra (PRA) A generalization of standard relational algebra

DocNo Term

1

2

3

3

4

IR

DB

IR

DB

AI

Prob

0.9

0.7

0.8

0.5

0.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

A Table in Probabilistic Database

Event expression

Independent events

Probabilistic Relational Algebra

Just like in Relational Algebra… Selection Projection Join Union Difference -

DocNo Term

1

2

3

3

4

IR

DB

IR

DB

AI

Prob

0.9

0.7

0.8

0.5

0.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Selection

DocNo Term

1

3

IR

IR

Prob

0.9

0.8

Complex Event

eDT(1, IR)

eDT(3, IR)

In derived table

Propositional expression of basic events

DocNo Term

1

2

3

3

4

IR

DB

IR

DB

AI

Prob

0.9

0.7

0.8

0.5

0.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Projection

Term

IR

DB

AI

Prob

0.98

0.85

0.80

Complex Event

eDT(1, IR) eDT(3, IR)

eDT(2, DB) eDT(2, DB)

eDT(4, AI)

Join

DocNo Term

1

2

IR

DB

Prob

0.9

0.7

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

DocNo AName

1

2

Bauer

Meier

Prob

0.9

0.8

Basic Event

eDU(1, Bauer)

eDU(2, Meier)

DocAu:

DocAu.DocNo

AName DocTerm.DocNo

Term

1

1

2

2

Bauer

Bauer

Meier

Meier

1

2

1

2

IR

DB

IR

DB

Prob

0.9*0.9

0.9*0.7

0.8*0.9

0.8*0.7

Complex Event

eDU(1, Bauer) eDT(1, IR)

eDU(1, Bauer) eDT(2, DB)

eDU(2, Meier) eDT(1, IR)

eDU(2, Meier) eDT(2, DB)

DocNo Term

1

2

3

3

4

IR

DB

IR

DB

AI

Prob

0.9

0.7

0.8

0.5

0.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

Join + Projection

DocNo

1

3

Prob

0.9

0.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

2

3

Prob

0.7

0.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1

2

2

2

3

4

4

Bauer

Bauer

Meier

Schmidt

Schmidt

Koch

Bauer

Prob

0.9

0.3

0.9

0.8

0.7

0.9

0.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

Bauer

Schimdt

AName

Bauer

Meier

Schmidt

Prob

0.81

0.56

Complex Event


eDU(3, S) eDT(3, IR)

Prob

0.21

0.63

0.91

Complex Event



(eDU(2, S) eDT(2, DB))

(eDU(3, S) eDT(3, DB) )

AName

Bauer

Schmidt

0.81 * 0.21 = 0.1701

0.56 * 0.91 = 0.5096

Prob Complex Event

(eDU(1, B) eDT(1, IR)) (eDU(2, B) eDT(2, DB))

(eDU(3, S) eDT(3, IR) )

( (eDU(2, S) eDT(2, DB)) (eDU(3, S) eDT(3, DB) ) )0.4368

DocNo Term

1

2

3

3

4

IR

DB

IR

DB

AI

Prob

0.9

0.7

0.8

0.5

0.8

DocTerm:

Basic Event

eDT(1, IR)

eDT(2, DB)

eDT(3, IR)

eDT(3, DB)

eDT(4, AI)

DocNo

1

3

Prob

0.9

0.8

Complex Event

eDT(1, IR)

eDT(3, IR)

IR:

DocNo

2

3

Prob

0.7

0.5

Complex Event

eDT(2, DB)

eDT(3, DB)

DB:

DocNo AName

1

2

2

2

3

4

4

Bauer

Bauer

Meier

Schmidt

Schmidt

Koch

Bauer

Prob

0.9

0.3

0.9

0.8

0.7

0.9

0.6

Basic Event

eDU(1, Bauer)

eDU(2, Bauer)

eDU(2, Meier)

eDU(2, Schmidt)

eDU(3, Schmidt)

eDU(3, Koch)

eDU(3, Bauer)

DocAu:

AName

Bauer

Schimdt

AName

Bauer

Meier

Schmidt

Prob

0.81

0.56

Complex Event


eDU(3, S) eDT(3, IR)

Prob

0.21

0.63

0.91

Complex Event



(eDU(2, S) eDT(2, DB))

(eDU(3, S) eDT(3, DB) )

AName

Bauer

Schmidt

0.81 * 0.21 = 0.1701

0.56 * 0.91 = 0.5096

Prob Complex Event


(eDU(3, S) eDT(3, IR) )


Intensional Semanticsv.s.

Extensional Semantics

Join + Projection

Intensional v.s Extensional

Intensional Semantics Assume data independence of base tables Keeps track of data dependence during the

evaluation Extensional Semantics

Assume data independence during the evaluation Could be WRONG with probability computation!

When Intensional = Extensional?

No identical underlying basic events in the event expression

AName

Bauer

Schmidt

Prob

0.81 * 0.21 = 0.1701

0.56 * 0.91 = 0.5096

Complex Event


(eDU(3, S) eDT(3, IR) )


Identical basic event

Fubr & Rölleke 1997

Summary Probabilisitc DB Model

Concept of event Basic v.s. complex event Event expression

Probabilistic Relational Algebra Just like in Relational Algebra…

Computation of event probabilities Intensional v.s. extensional semantics Yield the same result when NO data dependence in event

expressions

Outline

Background Probabilistic database model Top-k queries & scoring functions

Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Top-k Queries

Traditonally, givenObjects: o1, o2, …, on

An non-negative integer: k

A scoring function s:

Question:

What are the k objects with the highest score?

Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

Scoring Function

A scoring function s over a deterministic relation R is

For any ti and tj from R,

Outline

Background Motivation Examples

Smart Enviroment Example Sensor Network Example

Top-k Queries in Probabilistic Databases Conclusion

Motivating Example I

Smart Environment Sample Question

“Who were the two visitors in the lab last Saturday night?” Data

Biometric data from sensors We would be able to see how those data match the profile of every

candidate -- a scoring function Historical statistics

e. g. Probability of a certain candidate being in lab on Saturday nights

Motivating Example I (cont.)

Face Voice Detection, Detection,

Aiden score( 0.70 , 0.60, … ) = 0.65

Bob score( 0.50 , 0.60, … ) = 0.55

Chris score( 0.50 , 0.40, … ) = 0.45

Probability of being in lab on Saturday nights

0.3

0.9

0.4

PersonnelBiometrics

score( … )

Question: Find two people in the lab last Saturday night

a Top-2 query over the above probabilistic database under the above scoring function

Motivating Example II

Sensor Network in a Habitat Sample Question

“What is the temperature of the warmest spot?” Data

Sensor readings from different sensors At a sampling time, only one “real” reading from a

sensor Each sensor reading comes with a confidence value

Motivating Example II (cont.)

Temp (F)

22

10

25

15

Prob

0.6

0.1

Question: What is the temperature of the warmest spot?

a Top-1 query over the above probabilistic database under the scoring function proportional to temperature

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

Outline

Background Motivation Examples Top-k Queries in Probabilistic Databases

Semantics Query Evaluation

Conclusion

Models A probabilistic relation Rp=<R, p, >

R: the support deterministic relation p: probability function : a partition of R, such that

Simple v.s. General probabilistic relation Simple

Assume tuple independence, i.e. ||=|R| E.g. smart environment example

General Tuples can be independent or exclusive, i.e. ||<|R| E.g. sensor network example

Challenges

Given A probabilistic relation Rp=<R, p, > An injective scoring function s over R

No ties A non-negative integer k

What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

What is a “Good” Semantics?

Desired Properties Exact-k Faithfulness Stability

Properties

Exact-k If R has at least k tuples, then exactly k tuples are returned

as the top-k answer

Faithfulness A “better” tuple, i.e. higher in score and probability, is more

likely to be in the top-k answer, compared to a “worse” one

Stability Raising the score/prob. of a winning tuple will not cause it

to lose Lowering the score/prob. of a losing tuple will not cause it

to win

Global-Topk Semantics



What is the top-k answer set over Rp ? (Semantics) Global-Topk

Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds

Global-Topk satisfies aforementioned three properties

Smart Environment Example

Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden


Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )

Query: Find two people in lab on last Saturday night

Aiden

Bob

Chris

Aiden Bob ChrisAiden

Bob

Aiden

Chris

Bob

Chris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292

Global-Topk Semantics:

Pr(Aiden in top-2) = 0.3Pr(Bob in top-2) = 0.9 Top-2 Answer

Other Semantics

Soliman, Ilyas & Chang 2007 Two Alternative Semantics

U-Topk U-kRanks

U-Topk Semantics



What is the top-k answer set over Rp ? (Semantics) U-Topk

Return the most probable top-k answer set that belongs to possible worlds

U-Topk does not satisfies all three properties


Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden


Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )


Aiden

Bob

Chris


Bob

Aiden

Chris

Bob

Chris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27

U-Topk Semantics:

…Pr({Bob}) = 0.378 Top-2 Answer

U-kRanks Semantics



What is the top-k answer set over Rp ? (Semantics) U-kRanks

For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds

U-kRanks does not satisfies all three properties


Score( 0.50 , 0.40, … ) = 0.45Chris

Score( 0.50 , 0.60, … ) = 0.55Bob

Score( 0.70 , 0.60, … ) = 0.65Aiden


Prob.

0.3

0.9

0.4

PersonnelBiometrics

Score( … )


Aiden

Bob

Chris


Bob

Aiden

Chris

Bob

Chris

0.0180.042 0.378 0.028 0.1620.108

0.2520.012

Top-2

possible worlds

e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292

U-kRanks Semantics:

Top-2 Answer{Bob}

Aiden Bob

Rank-1

Rank-2

0.3

0

0.63

0.27

0.028

0.264

Chris Highest at rank-1

Highest at rank-2

Properties

Semantics Exact-k Faithfulness Stability

Global-Topk

U-Topk

U-kRanks

Yes

No

No

Yes

Yes/No*

No

Yes

Yes

No

* Yes when the relation is simple, No otherwise

A better sementics

Challenges



What is the top-k answer set over Rp ? (Semantics)

How to compute the top-k answer of Rp ? (Query Evaluation)

Global-Topk

Global-Topk in Simple Relation

Given Rp=<R, p, >, a scoring function s, a non-negative integer k Assumptions

Tuples are independent, i.e. ||=|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.

Global-Topk in Simple Relation

Query Evaluation Recursion

Pk,s(ti): Global-Topk probability of tuple ti

Dynamic Programming

Optimization Threshold Algorithm (TA)

[Fagin & Lotem 2001] Given a system of objects, such that

For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute

An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm)

f is monotonic f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’i for every i

TA is cost-optimal in finding the top-k objects TA and its variants are widely used in ranking queries, e.g.

top-k, skyline, etc.

Applying TA Optimization

Global-Topk Two attributes: probability & score Aggregation function: Global-Topk probability

Global-Topk in General Relation

Given Rp=<R, p, >, a scoring function s, a non-negative integer k Assumptions

Tuples are independent or exclusive, i.e. ||<|R| R={t1,t2,…tn}, ordered in the decreasing order of their

scores, i.e.


Induced Event Relation For each tuple in R, there is a probabilistic relation

Ep=<E, pE, E> generated by the following two rules

Ep is simple

Sensor Network Example

Temp (F)

22

10

25

15

Prob

0.6

0.1

0.4

0.6

C1 (from Sensor 1)

C2 (from Sensor 2)

15 0.6

Event

teC1

tet

0.6 =

0.6 = p(t)

For example:

Induced Event Relation (simple)

t=

where i=1

Prob

Rule 1

Rule 2

Prob. Relation (general)

Evaluating Global-Topk in General Relation

For each tuple t, generate corresponding induced event relation

Compute the Global-Topk probability of t by Theorem 4.3

Pick the k tuples with the highest Global-Topk probability

Summary on Query Evaluation

Simple (Independent Tuples) Dynamic Programming

Tuples are ordered on their scores Recursion on the tuple index and k

General (Independent/Exclusive Tuples) Polynomial reduction to simple cases

Complexity

Global-Topk

U-Topk U-kRanks

Simple O(kn) O(kn) O(kn)

General O(kn2) Θ(mknk-1 lg n)* Ω(mnk-1)*

* m is a rule engine related factor m represents how complicated the relationship between tuples could be

Outline

Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion

Conclusion

Three intuitive semantic properties for top-k queries in probability databases

Global-Topk semantics which satisfies all the properties above

Query evaluation algorithm for Global-Topk in simple and general probabilistic databases

Future Problems Weak order scoring function

Allow ties Not clear how to extend properties Not clear how to define the semantics (other than “arbitrary

tie breaker”) Preference Strength

Sensitivity to Score Given a prob. relation Rp, if the DB is sufficiently large, by

manipulating the scores of tuples, we would be able to get different answers

NOT satisfied by our semantics NOT satisfied by any semantics in literature

Need to consider preference strength in the semantics

Thank you !

Related Works

Introduction to Probabilistic Databases Probabilistic DB Model & Probabilistic Relational

Algebra [Fubr & Rölleke 1997] Top-K Query in Probabilistic Databases

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases [Zhang & Chomicki 2008]

Alternative Top-k Semantics and Query Evaluation in Probabilistic Databases [Soliman, Ilyas & Chang 2007]

on the semantics and evaluation of top-k queries in probabilistic databases presented by xi zhang...

Documents

db e dt3

ir e dt3

ir e dt2

db e du2

ir e du2

db e du3

ir e du3

db e dt4