sigir’2005 gravitation-based model for information retrieval shuming shi ji-rong wen qing yu...

39
SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia [email protected] From: http://www.awesomelibrary.org/images/solar-system-nasa.jpg

Upload: tyler-leppard

Post on 01-Apr-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Gravitation-Based Model for Information Retrieval

Shuming ShiJi-Rong Wen

Qing YuRuihua SongWei-Ying Ma

Microsoft Research Asia

[email protected]: http://www.awesomelibrary.org/images/solar-system-nasa.jpg

Page 2: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background

Document:

Query:

A core problem in Information Retrieval (IR):

Determine the relevance of a document to a query

Relevant?

How relevant?

Page 3: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

IR Models & Perspectives IR models define the representation of documents, queries,

and the relevance relationship between them The key behind all IR models is primary perspectives on

information retrieval

Model Perspective

Boolean model Set theory and Boolean algebra

Vector space model Vector and linear algebra

Probabilistic modelProbabilistic

Language model

… … …

Background

Page 4: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Hard questions What is the essence of information retrieval? What is the right perspective of it?

Till now, we know more about IR each time when a new perspective is adopted

It would also be helpful to view IR problems from more new perspectives

We try to view IR from the perspective of physics

Background

Page 5: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

221

d

mGmF

From: http://csep10.phys.utk.edu/astr161/lect/history/newtongrav.html

d

m1 m2

Background

(1687 AD.)

Page 6: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

From http://www.enterprisemission.com/hyper2a.php

Background

Page 7: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

We are living in a physical world which is dominated by fundamental physics laws.

Can we get help from “the God” in acquiring deeper understanding of information retrieval?

Simply start from Newton’s Universal Law of Gravitation…

Background

Page 8: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

We build a new IR model GBM from which many effective ranking functions can be derived

The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function.

A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions.

Preliminary Achievements

It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements,

First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters (k1=1.2, b=0.75; or k1=2, b=0.75) can be approximated numerically

We lack a complete derivation of BM25 formula in theory.

Page 9: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Outline

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval Summary

Page 10: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Document:

Query:

A mapping is need to be build from concepts of information retrieval to those of physics

Document Object

Query Object

GBM: Initial Idea

Relevancescore

Attractive force

IR concepts & notations:

|D| Document length

df(t) Document frequency of t

avdl Average document length in a collection

N Total number of documents

c(t,D) Times of occurrences of t in D (or written as tf(t,D))

Physics concepts

mass distance … …

Page 11: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Particle (=atom): Basic element of any object A particle has two attributes: mass and type

Type: Determined by the term object it composes

otherwise

PtypePtypeifd

mGmPPF

0

)()(),( 212

21

21

d

P1: m1, type(P1) P2: m2, type(P2)

GBM: Notations & Basic Concepts

Page 12: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

tnt1

di(tn,D)

document D (with diameter di(D))

di(D)

r(tn,D)t2

Term t1

A particle in term t1

)(

),()(DHDt

DtmDmU

)(

),()(DHDt

DtdiDdiU

GBM: Notations & Basic Concepts

Two natural assumptions:

H(D): Hidden terms in document D

A term object has 4 attributes: type, shape, mass, and diameter

Page 13: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Notation List

Page 14: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval Summary

Outline

Page 15: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Discrete GBM Model

t2t1 x t3 x t1 t2 t2 t1x

document D

t1

t2

t3

t1t1 t1 t1 x x

t3t3 t3 x x

t2 t2 x x

The structure of D is changed under the attraction of

query Q

query Q

l

Key Points:

1. Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state.

2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term-placement state.

Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized

Page 16: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

1),(

02

1),(

0 ),()2/1(1

),(),(),,(),(

Dtc

i

Dtc

idisdis

Dtdii

DtmQtmGiDtFDtF

Term Weighting Formula

t1

t2

t3

t1t1 t1 t1 x x

t3t3 t3 x x

t2 t2 x x

query Q

l

2),()2/1(

),(),(),,(

Dtdiil

DtmQtmGiDtFdis

Qt

disdis DtFDQF ),(),(

The maximal (optimized) gravitational force between t and D:

The force between query term t and its i-th nearest occurrence in D:

The attractive force between D and Q:

Unknown expressions:

m(t,Q), m(t,D), and di(t,D)

Need: Mass and diameter estimation

Page 17: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

(Assumption-3)

Mass and Diameter Estimation

)(

)(

),(

),(

2

1

2

1

tm

tm

Dtm

Dtm

0)(

|)(|

)(

mDHD

tmDHDt

|))((|

)()(),(

0 DHDm

tmDmDtm

|)(|

)(),(

DHD

DdiDtdi

avdlDH |)(|

(Assumption-1)

(Assumption-2)

(Assumption-4)

avdl

DavdlDHD

||)1(|)(|

)1/(1 denote

For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection.

Assume that all terms in the same document have equal diameters

Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection.

)(tm

Page 18: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Ultimate Discrete GBM Formula

The ultimate term-weighting function:

where and

The average (document-independent) mass of term t in the collection

)(

),()(

)21

(1

),(1

)(),(1),(

021 tm

DD

i

DDmcDtF

Dtc

idis

/

)()(

avdl

DdiD

avdl

DD

||)1(),(

The mass of a document is a measure of its quality, which depends on how informative and important it is.

Relationship with PageRank? <Future work>

Page 19: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Then a special case of the term-weighting function:

whereavdl

DD

||)1(),(

))(

1ln(

)),(

)21

(1(

),(1

),(1),(

0 2 tdf

N

Di

DDtF

Dtc

idis

Two parameters: ,

Ifm(D) = const, di(D) = const, and

)(

1ln)(

tdf

Ntm

Ultimate Discrete GBM Formula

Page 20: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval Summary

Outline

Page 21: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

di(t1,D)t1 t2 tn

document D (with diameter di(D))

Term t2

di(D)

t1

t2

t3

query Q

lt1 t1 t1 t1 x x

t2 t2 x x

t3 t3 t3 x x

document D

Document D is now in its optimized-term-placement state

Continuous GBM Model

di(t1,D)A particle

Term t1

Term shape: Ideal cylinder

Page 22: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Term Weighting Formula

The maximal (optimized) gravitational force between t and D:

The force between query term t and its i-th nearest occurrence in D:

),()1(

),(

),(

),(

),(

),(

),(

),(),(),,(

),()1(

),(2 Dtdiil

Dtm

Dtdiil

Dtm

Dtdi

QtmGdx

xDtdi

DtmQtmGiDtF

Dtdiil

Dtdiil

con

),(),(

1),(

),(

),(),(),,(),(

1),(

0 DtcDtdi

Dtc

Dtdi

DtmQtmGiDtFDtF

Dtc

iconcon

t1

t2

t3

query Q

lt1 t1 t1 t1 x x

t2 t2 x x

t3 t3 t3 x x

document D

Page 23: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

)(),(),(

)(1

),(

)(

)(),( 1 tm

DtcDD

Dtc

D

DmcDtFcon

Ultimate Continuous GBM Formula

By doing mass and diameter estimation, we have the ultimate term-weighting function:

where and

/

)()(

avdl

DdiD

avdl

DD

||)1(),(

))(

1ln(

),(),(1

),(),(

tdf

N

DtcD

DtcDtFcon

Then a special case of the above term-weighting function:

(Two parameters: ) ,

If: m(D) = const, di(D) = const, and)(

1ln)(

tdf

Ntm

Page 24: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval Summary

Outline

Page 25: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

))(

1ln(

),(),(1

),(),(

tdf

N

DtcD

DtcDtFcon

A special case of the continuous GBM term-weighting function:

whereavdl

DD

||)1(),(

Continuous GBM Formula vs. BM25

)(),(),(

),()1(),(

1

125 tw

DtcDbk

DtckDtwbm

BM25 term-weighting function

Page 26: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Other Ranking Formulas Derived

Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions

Page 27: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

[Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy

TFC1, TFC2 TDC M-TDC LNC1, LNC2 TF-LNC

All our derived term weighting functions satisfy all the above constraints.

Check with Heuristic Constraints

Page 28: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Experimental Setup

Preliminary Experiments

Corpora characteristics

Query-sets used in the experiments

Page 29: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Preliminary Experiments

Optimal performance comparison among some formulas over various corpora and tasks

(measure: mean average precision)

Experimental Results

Page 30: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval skip

Summary

Outline

Page 31: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

A document is said to be structured here when it contains multiple fields.

Current approaches for structured document retrieval

Score combination The most commonly used and well-studied approach Rank combination is a special case of score combination

Term-frequency combination [Robertson et al, CIKM’04]: An extension of BM25 [Ogilvie et al, SIGIR’03]: Linearly combining language models

Each approach works moderately well, but…

Structured Document Retrieval

Page 32: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04])

tf(t1,F1)=2tf(t2,F1)=0

tf(t1,F2)=2tf(t2,F2)=0

d1 d2

tf(t1,F3)=2tf(t2,F3)=0

...

tf(t1,F8)=2tf(t2,F8)=0

tf(t1,F1)=2tf(t2,F1)=2

tf(t1,F2)=2tf(t2,F2)=2

tf(t1,F3)=0tf(t2,F3)=0

...

tf(t1,F8)=0tf(t2,F8)=0

Score Combination Issues

score(d1) = s + s + s + … + s = 8sscore(d2) = 2s + 2s + 0 + … + 0 = 4s

score(d1) > score(d2)

Unreasonable

Page 33: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Consider a single-term query Q=t, and some documents with two fields (F1, F2).Assuming: w1 = weight(F1) = 5; w2 = weight(F2) = 1

TF Combination Issues

Example-1

(assuming |d1|=|d2|)

tf(t,F1)=1

tf(t,F2)=0

tf(t,F1)=0

tf(t,F2)=6

d1 d2

Example-2

(assuming |d3|=|d4|)

tf(t,d1) = w1 * 1 + w2 * 0 = 5tf(t,d2) = w1 * 0 + w2 * 6 = 6

tf(t,d3) = w1 * 1 + w2 * 8 = 13

tf(t,d4) = w1 * 0 + w2 * 14 = 19

tf(t,F1)=1

tf(t,F2)=8

tf(t,F1)=0

tf(t,F2)=14

d3 d4

score(d1) < score(d2)Reasonable

score(d3) < score(t,d4)Unreasonable

Larger w1?Can’t remove this issuePotential risk of making the case

of example-1 unreasonable

Page 34: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

A document D with 2 fields

t1,9 t1,97t1,16 t3,32 t1,33 t2,98 t3,99

t1,0 t2,1 t3,2 t1,15 t3,17 t2,18x,16

t3,0 x,1 (field 2)

(field 1)t1,34

t1

t2

t3

query Q

lt1,0

t2,1

t3,2

t1,15

t3,17

t2,18

t1,9t1,97 t1,16

t3,32

t1,33

t2,98

t3,99 t3,0

t1,34The optimized term placement of document D given query Q

Structured Document Retrievalby GBM

Page 35: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Experimental Results

0. 277

0. 121

0. 150

0. 103

0. 32 0. 33

0. 37

0. 152

0. 000

0. 050

0. 100

0. 150

0. 200

0. 250

0. 300

0. 350

0. 400

2003. td 2004. mi xed

Aver

age

prec

isio

n

Basel i neFreqCombScoreCombGBM

Performance comparison of different approaches for the combination of body and title fields

Page 36: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Background Gravitation-based Model

Notations & Basic Concepts Discrete GBM Model Continuous GBM Model Model analysis

GBM Model for Structured Document Retrieval Summary

Outline

Page 37: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives.

This paper may be a first step to take a physics viewpoint

It is encouraging that we can really benefit from the nature A family of effective ranking functions derived Give BM25 a physics interpretation A more reasonable approach for structured document retrieval

obtained

Summary

Page 38: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

Sorry, Sir Isaac Newton. Hope I am not abusing your laws.

Page 39: SIGIR’2005 Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com

SIGIR’2005

The End

Gravitation-Based Model for Information Retrieval

Please send your comments to: [email protected]