probabilistic ranking

31
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik

Upload: felix75

Post on 14-Jun-2015

456 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: probabilistic ranking

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Page 2: probabilistic ranking

Vagelis Hristidis VLDB 2004 2

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 3: probabilistic ranking

Vagelis Hristidis VLDB 2004 3

Motivation

SQL Returns Unordered Sets of Results

Overwhelms Users of Information Discovery Applications

How Can Ranking be Introduced, Given that ALL Results Satisfy Query?

Page 4: probabilistic ranking

Vagelis Hristidis VLDB 2004 4

Example – Realtor Database House Attributes: Price, City, Bedrooms,

Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year

Query: City =`Seattle’ AND Waterfront = TRUE

Too Many Results! Intuitively, Houses with lower Price,

more Bedrooms, or BoatDock are generally preferable

Page 5: probabilistic ranking

Vagelis Hristidis VLDB 2004 5

Rank According to Unspecified Attributes

Score of a Result Tuple t depends on Global Score: Global Importance of

Unspecified Attribute Values [CIDR2003] E.g., Newer Houses are generally preferred

Conditional Score: Correlations between Specified and Unspecified Attribute Values E.g., Waterfront BoatDock

Many Bedrooms Good School District

Page 6: probabilistic ranking

Vagelis Hristidis VLDB 2004 6

Key Problems

Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function.Use Probabilistic Information Retrieval (PIR).

How to Calculate the Global and Conditional Scores.Use Query Workload and Data.

Page 7: probabilistic ranking

Vagelis Hristidis VLDB 2004 7

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 8: probabilistic ranking

Vagelis Hristidis VLDB 2004 8

Architecture

Page 9: probabilistic ranking

Vagelis Hristidis VLDB 2004 9

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 10: probabilistic ranking

Vagelis Hristidis VLDB 2004 10

PIR Review Bayes’ Rule Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document (Tuple) t, Query QR: Relevant DocumentsR = D - R: Irrelevant Documents

Page 11: probabilistic ranking

Vagelis Hristidis VLDB 2004 11

Ranking Function – Adapt PIR

Query Q: X1=x1 AND … AND Xs=xs, X ={X1, …, Xs}

Result-Tuple t(X,Y), where X: Specified Attributes, Y: Unspecified Attributes

( | ) ( , | ) ( | ) ( | , ) ( | )( )

( | ) ( , | ) ( | ) ( | , ) ( | , )

p t R p X Y R p X R p Y X R p Y RScore t

p t D p X Y D p X D p Y X D p Y X D

RD X, R, D common.R satisfies X.

Page 12: probabilistic ranking

Vagelis Hristidis VLDB 2004 12

Ranking Function – Limited Conditional Independence Given a query Q and a tuple t, the X (and Y)

values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

( | ) ( | )x X

p X C p x C

( | ) ( | )

y Y

p Y C p y C

Yy XxYy DyxpDyp

RyptScore

),|(

1

)|(

)|()(

Use Data

, C involves Y, R, D

, C involves X, R, DUse Workload

Page 13: probabilistic ranking

Vagelis Hristidis VLDB 2004 13

Atomic Probabilities Estimation Using Workload W

If Many Queries Specify Set X of Conditions then there is Preference Correlation between Attributes in X.

Global: E.g., If Many Queries ask for Waterfront then p(Waterfront=TRUE) is high.

Conditional: E.g., If Many Queries ask for 4-Bedroom Houses in Good School Districts, then p(Bedrooms=4 | SchoolDistrict=`good’), p(SchoolDistrict=`good’ | Bedrooms=4) are high.

( | ) ( | , ) ( | ) ( | , )x X

p y R p y X W p y W p x y W

Using Limited Conditional

IndependenceGlobal Part

Conditional Part

Probabilities p(x | y, W) (p(x | y, D)) Calculated Using Standard Association Rule Mining Techniques on W (D)

Page 14: probabilistic ranking

Vagelis Hristidis VLDB 2004 14

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 15: probabilistic ranking

Vagelis Hristidis VLDB 2004 15

Performance

Page 16: probabilistic ranking

Vagelis Hristidis VLDB 2004 16

Scan Algorithm

Preprocessing - Atomic Probabilities Module

Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each

Result-Tuple Return Top-K Tuples

Page 17: probabilistic ranking

Vagelis Hristidis VLDB 2004 17

List Merge Algorithm

Preprocessing For Each Distinct Value x of Database, Calculate and

Store the Conditional (Cx) and the Global (Gx) Lists as follows

For Each Tuple t Containing x Calculate

tz Dzxp

WzxpCondScore

),|(

),|(

tz Dzp

WzpGlobScore

)|(

)|(

Execution Query Q: X1=x1 AND … AND Xs=xs

Execute Threshold Algorithm [Fag01] on the following lists: Cx1,…,Cxs, and Gxb, where Gxb is the shortest list among Gx1,…,Gxs

and add to Cx and Gx respectively Sort Cx, Gx by decreasing scores

Yy XxYy Dyxp

Wyxp

Dyp

WyptScore

),|(

),|(

)|(

)|()(Final

Formula:

Page 18: probabilistic ranking

Vagelis Hristidis VLDB 2004 18

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 19: probabilistic ranking

Vagelis Hristidis VLDB 2004 19

Quality Experiments Compare our Conditional Ranking Method with

the Global Method [CIDR03] Surveyed 14 MSR employees Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com) Each User Behaved According to Various

Profiles. E.g.: singles, middle-class family, rich retirees… teenage males, people interested in comedies of the

80s… First Collect Workloads, Then Compare Results

of 2 Methods for a Set of Queries

Page 20: probabilistic ranking

Vagelis Hristidis VLDB 2004 20

Quality Experiments – Average Precision

Seattle Homes Movies COND GLOB COND GLOB

Q1 0.70 0.26 0.48 0.35 Q2 0.76 0.62 0.53 0.43 Q3 0.90 0.54 0.58 0.20 Q4 0.84 0.32 0.45 0.48 Q5 0.44 0.48 0.43 0.40

For 5 queries, ask users to Mark 10 out of a Set of 30

likely results containing: the Top-10 results of both the Conditional and Global plus a few randomly selected tuples.

Precision = Recall

Page 21: probabilistic ranking

Vagelis Hristidis VLDB 2004 21

Quality Experiments - Fraction of Users Preferring Each Algorithm

00.10.20.30.40.50.60.70.80.9

1

Q1 Q2 Q3 Q4 Q5

Query

Frac

tionB

ette

r

CONDITIONAL GLOBAL

00.10.20.30.40.50.60.70.80.9

1

Q1 Q2 Q3 Q4 Q5

Query

Frac

tionB

ette

r

CONDITIONAL GLOBAL

Seattle Homes and Movies Datasets 5 new queries Top-5 Result-lists

Page 22: probabilistic ranking

Vagelis Hristidis VLDB 2004 22

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Microsoft SQL Server 2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO Datasets

Compared Algorithms: LM: List Merge Scan

Page 23: probabilistic ranking

Vagelis Hristidis VLDB 2004 23

Performance Experiments - Precomputation

Time and Space Consumed by Index Module

Datasets Lists Building Time Lists Size

Seattle 1500 msec 7.8 MB

US 80000 msec 457.6 MB

Page 24: probabilistic ranking

Vagelis Hristidis VLDB 2004 24

Performance Experiments - Execution

Varying Number of Tuples Satisfying Selection Conditions

#Selected Tuples

LM Time (msec)

Scan Time (msec)

350 800 6515

2000 700 39234

5000 600 115282

30000 550 566516

80000 500 3806531

US Homes Database 2-Attributes Queries

Page 25: probabilistic ranking

Vagelis Hristidis VLDB 2004 25

Performance Experiments - Execution

0

2000

4000

6000

8000

10000

12000

14000

1 2 3

NumSpecifiedAttributes

Tim

e (

msec)

LM

Scan

US Homes Database

Page 26: probabilistic ranking

Vagelis Hristidis VLDB 2004 26

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 27: probabilistic ranking

Vagelis Hristidis VLDB 2004 27

Related Work [CIDR2003]

Use Workload, Focus on Empty-Answer Problem.

Drawback: Global Ranking Regardless of Query. E.g.: Tram is desirable to be away from expensive houses, but close to cheap.

Collaborative Filtering Require Training Data of Queries and their

Ranked Results Relevance-Feedback Techniques for

Learning Similarity in Multimedia and Relational Databases

Page 28: probabilistic ranking

Vagelis Hristidis VLDB 2004 28

Roadmap

Problem Definition Architecture Probabilistic Information Retrieval Performance Experiments Related Work Conclusion

Page 29: probabilistic ranking

Vagelis Hristidis VLDB 2004 29

Conclusions – Future Work

Conclusions Completely Automated Approach for the

Many-Answers Problem which Leverages Data and Workload Statistics and Correlations

Based on PIRFuture Work Empty-Answer Problem Handle Plain Text Attributes

Page 30: probabilistic ranking

Vagelis Hristidis VLDB 2004 30

Questions?

Page 31: probabilistic ranking

Vagelis Hristidis VLDB 2004 31

Performance Experiments - Execution

0

10000

20000

30000

40000

50000

60000

0 1000 2000 3000 4000

NumSelectedTuples

Tim

e (m

sec)

LM

LMM

Scan

LMM: List Merge where lists for one of the two specified attributes are missing, halving space

Seattle Homes Database