learning in a pairwise term-term proximity framework for information retrieval ronan cummins, colm...
DESCRIPTION
Copyright 2008 by CEBT INTRODUCTION The occurrences of the query-terms in the document Intuition – Documents in which query-terms occur closer together should be ranked higher than documents in which the query-terms appear far apart. The relationships between all query-terms – The pairwise similarity between terms Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-term proximity functions Performance evaluations 3TRANSCRIPT
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval
Ronan Cummins, Colm O’RiordanDigital Enterprise Research Institute
SIGIR 2009
2010. 07. 09.Summarized by Jaehui Park, IDS Lab., Seoul National University
Copyright 2008 by CEBT
CONTENTS INTRODUCTION RELATED RESEARCH PROXIMITY MEASURES PROXIMITY RETREIVAL MODEL EXPERIMENTS
SETUP RESULTS
CONCLUSION
2
Copyright 2008 by CEBT
INTRODUCTION The occurrences of the query-terms in the document
Intuition– Documents in which query-terms occur closer together should be
ranked higher than documents in which the query-terms appear far apart.
The relationships between all query-terms– The pairwise similarity between terms
Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-
term proximity functions Performance evaluations
3
Copyright 2008 by CEBT
PROXIMITY MEASURES1 2 3 4 5 6 7 8 9 1
011
12
13
14
D a b c d a b d e f g h a i JQ a b
4
pos(D,a) = {1,5,12}, pos(D,b)={2,6} tf(D,a) = 3, tf(D,b) = 2
12 measures are introduced. The distance between the positions of a pair of terms in a docu-
ment. (1~6) Combining the term-frequencies of each terms in the document
(7,8) The terms in the entire query (9,10) Normalization measures (11,12)
Copyright 2008 by CEBT
PROXIMITY MEASURES min_dist(a,b,D) = 1
The minimum distance between any occurrences of a and b in D.– closeness -> relatedness
diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D.
– Where each term tends to occur
avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position
combinations in D– Promoting the terms that consistently occur close to one another in a
localised area
5
Copyright 2008 by CEBT
PROXIMITY MEASURES avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1
The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term.– The occurrence of a at position 12 maybe completely unrelated to b
match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a
term is uniquely matched to another occurrence of a term
max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b.
– Useful normalization factor
6
Copyright 2008 by CEBT
PROXIMITY MEASURES sum(tf(a),tf(b)) = 3+2 = 5
The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms
prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D.
– An implicit indication of the proximity of both terms
fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms.
– A query specific measures
min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once
– min-dist+1 for a two-term query
7
Copyright 2008 by CEBT
PROXIMITY MEASURES dl(D) = 14
The length of the document– A useful factor for normalization in IR
qt(Q,D) = 2 The number of unique terms that match both document
and query
8
Copyright 2008 by CEBT
PROXIMITY MEASURES Correlations of measures
FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections
Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function
9
Copyright 2008 by CEBT
PROXIMITY MEASURES Inverse correlations
Exceptions: * qt: correlated with relevance
10
Copyright 2008 by CEBT
PROXIMITY RETRIEVAL MODEL Extending a vector model
Documents and queries as matrices– Ex) 3-term query
– w(): a standard term-weighting scheme– p(): a proximity function
No theoretical basis– An intuitive extension of a vector based approach– Genetic Programming (GP) technique
Combining some or all of the 12 proximity measures
11
Copyright 2008 by CEBT
EXPERIMENTAL SETUP Term weighting scheme
BM25 scheme
Previous work
Proximity function
The benchmark proximity functions BM25 + t() ES + t()
12
Copyright 2008 by CEBT
EXPERIMENTAL SETUP GP process
A heuristic stochastic search algorithm
Training Financial Times
– 69500 documents– Queries: 25 title only, 30 title + descriptions– Fitness function: MAP
GP– Ranking documents using the weighting scheme for top 3000 docu-
ments– 6 runs of GP
Initial population of 2000 for 30 generations Elitist strategy
13
Copyright 2008 by CEBT
EXPERIMENTAL RESULTS Wilcoxon signed-rank test
14
Copyright 2008 by CEBT
EXPERIMENTAL RESULTS Wilcoxon signed-rank test
15
Copyright 2008 by CEBT
CONCLUSION We have outlined an extensive list of measures that may
be used to capture the notion of proximity in a docu-ment.
We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance.
We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which
learns useful proximity functions. An evaluation of three proximity functions
It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16