SIGIR’2005
Gravitation-Based Model for Information Retrieval
Shuming ShiJi-Rong Wen
Qing YuRuihua SongWei-Ying Ma
Microsoft Research Asia
[email protected]: http://www.awesomelibrary.org/images/solar-system-nasa.jpg
SIGIR’2005
Background
Document:
Query:
A core problem in Information Retrieval (IR):
Determine the relevance of a document to a query
Relevant?
How relevant?
SIGIR’2005
IR Models & Perspectives IR models define the representation of documents, queries,
and the relevance relationship between them The key behind all IR models is primary perspectives on
information retrieval
Model Perspective
Boolean model Set theory and Boolean algebra
Vector space model Vector and linear algebra
Probabilistic modelProbabilistic
Language model
… … …
Background
SIGIR’2005
Hard questions What is the essence of information retrieval? What is the right perspective of it?
Till now, we know more about IR each time when a new perspective is adopted
It would also be helpful to view IR problems from more new perspectives
We try to view IR from the perspective of physics
Background
SIGIR’2005
221
d
mGmF
From: http://csep10.phys.utk.edu/astr161/lect/history/newtongrav.html
d
m1 m2
Background
(1687 AD.)
SIGIR’2005
From http://www.enterprisemission.com/hyper2a.php
Background
SIGIR’2005
We are living in a physical world which is dominated by fundamental physics laws.
Can we get help from “the God” in acquiring deeper understanding of information retrieval?
Simply start from Newton’s Universal Law of Gravitation…
Background
SIGIR’2005
We build a new IR model GBM from which many effective ranking functions can be derived
The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function.
A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions.
Preliminary Achievements
It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements,
First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters (k1=1.2, b=0.75; or k1=2, b=0.75) can be approximated numerically
We lack a complete derivation of BM25 formula in theory.
SIGIR’2005
Outline
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval Summary
SIGIR’2005
Document:
Query:
A mapping is need to be build from concepts of information retrieval to those of physics
Document Object
Query Object
GBM: Initial Idea
Relevancescore
Attractive force
IR concepts & notations:
|D| Document length
df(t) Document frequency of t
avdl Average document length in a collection
N Total number of documents
c(t,D) Times of occurrences of t in D (or written as tf(t,D))
Physics concepts
mass distance … …
SIGIR’2005
Particle (=atom): Basic element of any object A particle has two attributes: mass and type
Type: Determined by the term object it composes
otherwise
PtypePtypeifd
mGmPPF
0
)()(),( 212
21
21
d
P1: m1, type(P1) P2: m2, type(P2)
GBM: Notations & Basic Concepts
SIGIR’2005
tnt1
di(tn,D)
document D (with diameter di(D))
di(D)
r(tn,D)t2
Term t1
A particle in term t1
)(
),()(DHDt
DtmDmU
)(
),()(DHDt
DtdiDdiU
GBM: Notations & Basic Concepts
Two natural assumptions:
H(D): Hidden terms in document D
A term object has 4 attributes: type, shape, mass, and diameter
SIGIR’2005
Notation List
SIGIR’2005
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval Summary
Outline
SIGIR’2005
Discrete GBM Model
t2t1 x t3 x t1 t2 t2 t1x
document D
t1
t2
t3
t1t1 t1 t1 x x
t3t3 t3 x x
t2 t2 x x
The structure of D is changed under the attraction of
query Q
query Q
l
Key Points:
1. Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state.
2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term-placement state.
Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized
SIGIR’2005
1),(
02
1),(
0 ),()2/1(1
),(),(),,(),(
Dtc
i
Dtc
idisdis
Dtdii
DtmQtmGiDtFDtF
Term Weighting Formula
t1
t2
t3
t1t1 t1 t1 x x
t3t3 t3 x x
t2 t2 x x
query Q
l
2),()2/1(
),(),(),,(
Dtdiil
DtmQtmGiDtFdis
Qt
disdis DtFDQF ),(),(
The maximal (optimized) gravitational force between t and D:
The force between query term t and its i-th nearest occurrence in D:
The attractive force between D and Q:
Unknown expressions:
m(t,Q), m(t,D), and di(t,D)
Need: Mass and diameter estimation
SIGIR’2005
(Assumption-3)
Mass and Diameter Estimation
)(
)(
),(
),(
2
1
2
1
tm
tm
Dtm
Dtm
0)(
|)(|
)(
mDHD
tmDHDt
|))((|
)()(),(
0 DHDm
tmDmDtm
|)(|
)(),(
DHD
DdiDtdi
avdlDH |)(|
(Assumption-1)
(Assumption-2)
(Assumption-4)
avdl
DavdlDHD
||)1(|)(|
)1/(1 denote
For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection.
Assume that all terms in the same document have equal diameters
Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection.
)(tm
SIGIR’2005
Ultimate Discrete GBM Formula
The ultimate term-weighting function:
where and
The average (document-independent) mass of term t in the collection
)(
),()(
)21
(1
),(1
)(),(1),(
021 tm
DD
i
DDmcDtF
Dtc
idis
/
)()(
avdl
DdiD
avdl
DD
||)1(),(
The mass of a document is a measure of its quality, which depends on how informative and important it is.
Relationship with PageRank? <Future work>
SIGIR’2005
Then a special case of the term-weighting function:
whereavdl
DD
||)1(),(
))(
1ln(
)),(
)21
(1(
),(1
),(1),(
0 2 tdf
N
Di
DDtF
Dtc
idis
Two parameters: ,
Ifm(D) = const, di(D) = const, and
)(
1ln)(
tdf
Ntm
Ultimate Discrete GBM Formula
SIGIR’2005
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval Summary
Outline
SIGIR’2005
di(t1,D)t1 t2 tn
document D (with diameter di(D))
Term t2
di(D)
t1
t2
t3
query Q
lt1 t1 t1 t1 x x
t2 t2 x x
t3 t3 t3 x x
document D
Document D is now in its optimized-term-placement state
Continuous GBM Model
di(t1,D)A particle
Term t1
Term shape: Ideal cylinder
SIGIR’2005
Term Weighting Formula
The maximal (optimized) gravitational force between t and D:
The force between query term t and its i-th nearest occurrence in D:
),()1(
),(
),(
),(
),(
),(
),(
),(),(),,(
),()1(
),(2 Dtdiil
Dtm
Dtdiil
Dtm
Dtdi
QtmGdx
xDtdi
DtmQtmGiDtF
Dtdiil
Dtdiil
con
),(),(
1),(
),(
),(),(),,(),(
1),(
0 DtcDtdi
Dtc
Dtdi
DtmQtmGiDtFDtF
Dtc
iconcon
t1
t2
t3
query Q
lt1 t1 t1 t1 x x
t2 t2 x x
t3 t3 t3 x x
document D
SIGIR’2005
)(),(),(
)(1
),(
)(
)(),( 1 tm
DtcDD
Dtc
D
DmcDtFcon
Ultimate Continuous GBM Formula
By doing mass and diameter estimation, we have the ultimate term-weighting function:
where and
/
)()(
avdl
DdiD
avdl
DD
||)1(),(
))(
1ln(
),(),(1
),(),(
tdf
N
DtcD
DtcDtFcon
Then a special case of the above term-weighting function:
(Two parameters: ) ,
If: m(D) = const, di(D) = const, and)(
1ln)(
tdf
Ntm
SIGIR’2005
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval Summary
Outline
SIGIR’2005
))(
1ln(
),(),(1
),(),(
tdf
N
DtcD
DtcDtFcon
A special case of the continuous GBM term-weighting function:
whereavdl
DD
||)1(),(
Continuous GBM Formula vs. BM25
)(),(),(
),()1(),(
1
125 tw
DtcDbk
DtckDtwbm
BM25 term-weighting function
SIGIR’2005
Other Ranking Formulas Derived
Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions
SIGIR’2005
[Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy
TFC1, TFC2 TDC M-TDC LNC1, LNC2 TF-LNC
All our derived term weighting functions satisfy all the above constraints.
Check with Heuristic Constraints
SIGIR’2005
Experimental Setup
Preliminary Experiments
Corpora characteristics
Query-sets used in the experiments
SIGIR’2005
Preliminary Experiments
Optimal performance comparison among some formulas over various corpora and tasks
(measure: mean average precision)
Experimental Results
SIGIR’2005
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval skip
Summary
Outline
SIGIR’2005
A document is said to be structured here when it contains multiple fields.
Current approaches for structured document retrieval
Score combination The most commonly used and well-studied approach Rank combination is a special case of score combination
Term-frequency combination [Robertson et al, CIKM’04]: An extension of BM25 [Ogilvie et al, SIGIR’03]: Linearly combining language models
Each approach works moderately well, but…
Structured Document Retrieval
SIGIR’2005
For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04])
tf(t1,F1)=2tf(t2,F1)=0
tf(t1,F2)=2tf(t2,F2)=0
d1 d2
tf(t1,F3)=2tf(t2,F3)=0
...
tf(t1,F8)=2tf(t2,F8)=0
tf(t1,F1)=2tf(t2,F1)=2
tf(t1,F2)=2tf(t2,F2)=2
tf(t1,F3)=0tf(t2,F3)=0
...
tf(t1,F8)=0tf(t2,F8)=0
Score Combination Issues
score(d1) = s + s + s + … + s = 8sscore(d2) = 2s + 2s + 0 + … + 0 = 4s
score(d1) > score(d2)
Unreasonable
SIGIR’2005
Consider a single-term query Q=t, and some documents with two fields (F1, F2).Assuming: w1 = weight(F1) = 5; w2 = weight(F2) = 1
TF Combination Issues
Example-1
(assuming |d1|=|d2|)
tf(t,F1)=1
tf(t,F2)=0
tf(t,F1)=0
tf(t,F2)=6
d1 d2
Example-2
(assuming |d3|=|d4|)
tf(t,d1) = w1 * 1 + w2 * 0 = 5tf(t,d2) = w1 * 0 + w2 * 6 = 6
tf(t,d3) = w1 * 1 + w2 * 8 = 13
tf(t,d4) = w1 * 0 + w2 * 14 = 19
tf(t,F1)=1
tf(t,F2)=8
tf(t,F1)=0
tf(t,F2)=14
d3 d4
score(d1) < score(d2)Reasonable
score(d3) < score(t,d4)Unreasonable
Larger w1?Can’t remove this issuePotential risk of making the case
of example-1 unreasonable
SIGIR’2005
A document D with 2 fields
t1,9 t1,97t1,16 t3,32 t1,33 t2,98 t3,99
t1,0 t2,1 t3,2 t1,15 t3,17 t2,18x,16
t3,0 x,1 (field 2)
(field 1)t1,34
t1
t2
t3
query Q
lt1,0
t2,1
t3,2
t1,15
t3,17
t2,18
t1,9t1,97 t1,16
t3,32
t1,33
t2,98
t3,99 t3,0
t1,34The optimized term placement of document D given query Q
Structured Document Retrievalby GBM
SIGIR’2005
Experimental Results
0. 277
0. 121
0. 150
0. 103
0. 32 0. 33
0. 37
0. 152
0. 000
0. 050
0. 100
0. 150
0. 200
0. 250
0. 300
0. 350
0. 400
2003. td 2004. mi xed
Aver
age
prec
isio
n
Basel i neFreqCombScoreCombGBM
Performance comparison of different approaches for the combination of body and title fields
SIGIR’2005
Background Gravitation-based Model
Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis
GBM Model for Structured Document Retrieval Summary
Outline
SIGIR’2005
Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives.
This paper may be a first step to take a physics viewpoint
It is encouraging that we can really benefit from the nature A family of effective ranking functions derived Give BM25 a physics interpretation A more reasonable approach for structured document retrieval
obtained
Summary
SIGIR’2005
Sorry, Sir Isaac Newton. Hope I am not abusing your laws.
SIGIR’2005
The End
Gravitation-Based Model for Information Retrieval
Please send your comments to: [email protected]