query dependent pseudo-relevance feedback based on wikipedia sigir ‘09 advisor: dr. koh jia-ling...

39
Query Dependent Pseudo- Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Upload: shanon-dickerson

Post on 18-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Dependent Pseudo-Relevance Feedback based on WikipediaSIGIR ‘09

Advisor: Dr. Koh Jia-LingSpeaker: Lin, Yi-JhenDate: 2010/01/24

1

Page 2: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

2

Page 3: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction

•Query is too short to catch the specific information need clearly.

•For query expansion methods, we prefer PRF because it requires no user input.

•The problem is that the top retrieved documents frequently contain non-relevant documents, which introduce noise into the expanded query.

3

Page 4: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction

•Meanwhile, resources on the web have emerged which can potentially supplement an initial search better in PRF, e.g. Wikipedia.

•Wikipedia covers a great many topics, and it reflects the diverse interests and information needs of users.

•The basic entry in Wikipedia is an entity page, which is an article that contains information focusing on one single entity.

4

Page 5: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction•The aim of this study is to explore the

possible utility of Wikipedia as a resource improving for IR in PRF.

•We propose the effectiveness of three methods for expansion term selection, each modeling the Wikipedia based pseudo-relevance information from a different perspective.

•We incorporate the expansion terms into the original query and use language modeling IR to evaluate these methods.

5

Page 6: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction- Query•Using PRF on the basis of Wikipedia, we

categorize a query into one of three types:1) EQ : queries about specific entity “Scalable Vector Machine”2) AQ : ambiguous queries “Apple” ”, ”Piratehttp://en.wikipedia.org/wiki/Pirate_(disambiguation)”, “Poaching http://en.wikipedia.org/wiki/Poaching_(disambiguation)”3) BQ : broader queries (neither EQ nor AQ)

6

Page 7: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction- Pseudo-relevant docs•Pseudo-relevant documents are generated

in two ways according to the query type:1) using top ranked articles from Wikipedia (regarded as a collection) retrieved in response to the query2) using Wikipedia entity page corresponding to the query

7

Page 8: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Introduction- furthermore, consider terms distribution and structure•In selecting expansion terms, term

distributions and structures of Wikipedia pages are taken into account.

•We propose and compare a supervised method and an unsupervised method for this task.( field-based )

8

Page 9: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

IntroductionQuery“ Data mining ”

Select terms to expand query

Top ranked docs

9

Check Wikipedia to find pseudo-relevant doc

Page 10: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

10

Page 11: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization

•We briefly summarize the relevant features of Wikipedia for our study, and then examine the different categories of queries that are typically encountered in user queries.

•Wikipedia Data Set (summarize the relevant features )

•Query Categorization

11

Page 12: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Wikipedia Data Set

•An analysis of 200 randomly selected articles describing a single topic showed that only 4% failed to adhere to a standard format.

12

Page 13: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization•We define three types of queries according

to their relationship with Wikipedia topics:

•1) EQ : queries about specific entity “Scalable Vector Machine”

•2) AQ : ambiguous queries “Apple”, ”Pirate http://en.wikipedia.org/wiki/Pirate_(disambiguation)”, “Poaching http://en.wikipedia.org/wiki/Poaching_(disambiguation)”

•3) BQ : broader queries (neither EQ nor AQ)

13

Page 14: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization• Strategy for AQ (a disambiguation process)• Given an ambiguous query:• 1) using query likelihood language model (initial

search) to retrieve top ranked 100 documents• 2) cluster these documents using K-means• 3) rank the clusters by cluster-based language

model• 4) the top ranked cluster is then compared to all

the referents (entity pages) extracted from the disambiguation page associated with the query

• 5) the top matching entity page is then chosen for the query

14

Page 15: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization

•3) rank the clusters by cluster-based language model, as proposed by Lee et. Al [19]

15

Page 16: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization• Evaluation• total 650 queries• Each participant was asked to judge whether a

query is ambiguous or not.

• If it was, the participant determined which referent from the disambiguation page is most likely to be mapped to the query.

• If it was not, the participant manually search with the query in Wikipedia to identify whether or not it is defined by Wikipedia (EQ).

16

Page 17: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization

•Evaluation•Participants are in general agreement,

i.e., 87% in judging whether a query is ambiguous or not.

•Determining which referent should a query be mapped to, there is only 54% agreement.

17

Page 18: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Categorization• Evaluation : the effectiveness of the cluster-based

disambiguation process (most queries from TREC topic sets are AQ)

• We define that for each query :• If there are at least two participants who indicate

a referent as the most likely mapping target, this target will be used as an answer.

• If a query has no answer, it will not be counted by the evaluation process.

• Experimental results shows that our disambiguation process lead to an accuracy of 57%for AQ.

18

Page 19: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

19

Page 20: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Expansion Method

•#1 Relevance Model (baseline)•A relevance model is a query expansion

approach based on the language modeling framework.

•In the model, the query words q1,q2,…, qm and the word w in relevant documents are sampled identically and independent from a distribution R

20

Page 21: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Expansion Method

•#2 Strategy for Entity/Ambiguous Queries•Both EQ and AQ can be associated with a

specific Wikipedia entity page.•Instead of considering the top-ranked

documents from the test collection, only the corresponding entity page from Wikipedia will be used as pseudo-relevant information.

21

Page 22: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Expansion Method

•#2 Strategy for Entity/Ambiguous Queries•ranks all the terms in the entity page,

then the top K terms are chosen for expansion

•Score(t)= tf * idf• idf = log (N / df ) , N: the # of docs in Wikipedia collection

22

Page 23: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Query Expansion Method- consider terms distribution and structure

•#3 Field Evidence for Query Expansion•E.g. the importance of a term appearing

in the overview may be different than its appearance in an appendix.

•We examine two methods for utilizing evidence from different fields

•#3.1 Unsupervised Method•#3.2 Supervised Method

23

Page 24: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

#3.1 Unsupervised Method

•We replace the term frequency in a pseudo relevance document from original relevance model with a linearly combined weighted term frequencies.

24

Page 25: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

#3.2 Supervised Method

•A alternative way of utilizing evidence of field information is to transfer it into features for supervised learning.

•A radial-based kernel(RBF) SVM is used.•Each expansion terms is represented by a

feature vector (10 features in here)

25

Page 26: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

#3.2 Supervised Method

•The first group of features are term distributions in the PRF documents and collections.

•The features we used include:1) TD in the test collection2) TD in PRF (top 10) from the test collection3) TD in the Wikipedia collection4) TD in PRF (top 10) from the Wikipedia collection.

26

Page 27: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

#3.2 Supervised Method•The second group of features is based on

field evidence (structure).•As described before, we divide each entity

page into six fields. One feature is defined for each field; this is computed as follows:

27

Page 28: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

28

Page 29: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment

•Experiment Settings•In our experiment, documents are

retrieved for a given query by the query-likelihood language model(initial search).

•Experiments were conducted using four TREC collection

•Retrieval effectiveness is measured in terms of Mean Average Precision (mAP)

29

Page 30: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment

•Baselinesquery-likelihood model (QL)relevance model (RMC)relevance model based on Wikipedia (RMW)

•Parameters for relevance model: N=10 (pseudo relevant docs) K=100 (adding terms) λ=0.6

30

Page 31: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment• Using Entity Pages for Relevance Feedback• Our method utilizing only the entity page

corresponding to the query for PRF (RE)• Not all queries can be mapped to a specific

Wikipedia entity page, thus the method is only applicable to EQ and AQ.

31

Page 32: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment

•Field based expansion•Note that in our experiments on field

based expansion, the top retrieved documents are considered pseudo relevant documents for EQ and AQ.(not an entity page)

•BQ can be applied here.

32

Page 33: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment• Field based expansion• For the supervised method, we compare

two ways of incorporating expansion terms for retrieval.

• The first is to add the top ranked 100 good terms (SL).

• The second is to add the top ranked 10 good terms, each given the classification probability as weight (SLW).

• Unsupervised method: The relevance model with weighted TFs is denoted as (RMWTF).

33

Page 34: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment• Performance improves as

weights for Links, Content increase.

• the increase of weight to Overview leads to deterioration of the performance.

• This shows that the positions where a term appears have different impacts on the indication of term relevancy.

34

Page 35: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Experiment

•Query Dependent Expansion•RE for EQ and AQ, RMWTF for BQ.•We denote the query dependent method

as (QD)

35

Page 36: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

36

Page 37: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Conclusion•We have explored utilization of Wikipedia

in PRF. •TREC topics are categorized into three

types based on Wikipedia. (We evaluated these methods on four TREC collections.)

•We propose and study different methods for term selection using pseudo relevance information from Wikipedia entity pages.

•Our experimental results show that the query dependent approach can improve over a baseline relevance model.

37

Page 38: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Outline

•Introduction•Query Categorization•Query Expansion Methods•Experiments•Conclusion•Future Work

38

Page 39: Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1

Future Work• More investigation is needed for the broader

queries. • For ambiguous queries, if the disambiguation

process can achieve improved accuracy, the effectiveness of the final retrieval will be improved.

• For the supervised term selection method, the results obtained are not satisfactory in terms of accuracy.

• By combining the initial result from the test collection and Wikipedia together, one may be able to develop an expansion strategy which is robust to the query being degraded by either of the resources.

39