learning in a pairwise term-term proximity framework for information retrieval ronan cummins, colm...

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval

Ronan Cummins, Colm O’RiordanDigital Enterprise Research Institute

SIGIR 2009

2010. 07. 09.Summarized by Jaehui Park, IDS Lab., Seoul National University

Copyright 2008 by CEBT

CONTENTS INTRODUCTION RELATED RESEARCH PROXIMITY MEASURES PROXIMITY RETREIVAL MODEL EXPERIMENTS

SETUP RESULTS

CONCLUSION

2


INTRODUCTION The occurrences of the query-terms in the document

Intuition– Documents in which query-terms occur closer together should be

ranked higher than documents in which the query-terms appear far apart.

The relationships between all query-terms– The pairwise similarity between terms

Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-

term proximity functions Performance evaluations

3


PROXIMITY MEASURES1 2 3 4 5 6 7 8 9 1

011

12

13

14

D a b c d a b d e f g h a i JQ a b

4

pos(D,a) = {1,5,12}, pos(D,b)={2,6} tf(D,a) = 3, tf(D,b) = 2

12 measures are introduced. The distance between the positions of a pair of terms in a docu-

ment. (1~6) Combining the term-frequencies of each terms in the document

(7,8) The terms in the entire query (9,10) Normalization measures (11,12)


PROXIMITY MEASURES min_dist(a,b,D) = 1

The minimum distance between any occurrences of a and b in D.– closeness -> relatedness

diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D.

– Where each term tends to occur

avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position

combinations in D– Promoting the terms that consistently occur close to one another in a

localised area

5


PROXIMITY MEASURES avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1

The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term.– The occurrence of a at position 12 maybe completely unrelated to b

match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a

term is uniquely matched to another occurrence of a term

max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b.

– Useful normalization factor

6


PROXIMITY MEASURES sum(tf(a),tf(b)) = 3+2 = 5

The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms

prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D.

– An implicit indication of the proximity of both terms

fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms.

– A query specific measures

min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once

– min-dist+1 for a two-term query

7


PROXIMITY MEASURES dl(D) = 14

The length of the document– A useful factor for normalization in IR

qt(Q,D) = 2 The number of unique terms that match both document

and query

8


PROXIMITY MEASURES Correlations of measures

FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections

Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function

9


PROXIMITY MEASURES Inverse correlations

Exceptions: * qt: correlated with relevance

10


PROXIMITY RETRIEVAL MODEL Extending a vector model

Documents and queries as matrices– Ex) 3-term query

– w(): a standard term-weighting scheme– p(): a proximity function

No theoretical basis– An intuitive extension of a vector based approach– Genetic Programming (GP) technique

Combining some or all of the 12 proximity measures

11


EXPERIMENTAL SETUP Term weighting scheme

BM25 scheme

Previous work

Proximity function

The benchmark proximity functions BM25 + t() ES + t()

12


EXPERIMENTAL SETUP GP process

A heuristic stochastic search algorithm

Training Financial Times

– 69500 documents– Queries: 25 title only, 30 title + descriptions– Fitness function: MAP

GP– Ranking documents using the weighting scheme for top 3000 docu-

ments– 6 runs of GP

Initial population of 2000 for 30 generations Elitist strategy

13


EXPERIMENTAL RESULTS Wilcoxon signed-rank test

14


EXPERIMENTAL RESULTS Wilcoxon signed-rank test

15


CONCLUSION We have outlined an extensive list of measures that may

be used to capture the notion of proximity in a docu-ment.

We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance.

We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which

learns useful proximity functions. An evaluation of three proximity functions

It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16