cs590i: information retrieval

CS590I: Information Retrieval

CS-590I

Information Retrieval

Retrieval Models: Language models

Luo Si

Department of Computer Science

Purdue University

Introduction to language model

Retrieval Model: Language Model

Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing

Unigram language model

Document language model estimation

Model-based feedback

Language Models: Motivation

Vector space model for information retrieval Documents and queries are vectors in the term space Relevance is measure by the similarity between document

vectors and query vector

Problems for vector space model Ad-hoc term weighting schemes Ad-hoc similarity measurement

No justification of relationship between relevance and similarity

We need more principled retrieval models…

Introduction to Language Models:

Language model can be created for any language sample A document A collection of documents Sentence, paragraph, chapter, query…

The size of language sample affects the quality of language model

Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be reliable

Introduction to Language Models:

A document language model defines a probability distribution over indexed terms

E.g., the probability of generating a term

Sum of the probabilities is 1

A query can be seen as observed data from unknown models Query also defines a language model (more on this later)

How might the models be used for IR? Rank documents by Pr( | )

Rank documents by language models of and based on kullback-Leibler (KL) divergence between the models (come later)

id

q

id

q

Language Model for IR: Example

Estimating language model for each document

sport, basketball, ticket, sport

1d

basketball, ticket, finance, ticket, sport

2d

stock, finance, finance, stock

3d

Language Model for 1d



Estimate the generation probability of Pr( | )q

idq

sport, basketball

Generate retrieval results

Language Models

Three basic problems for language models

What type of probabilistic distribution can be used to

construct language models?

How to estimate the parameters of the distribution of the

language models?

How to compute the likelihood of generating queries given

the language modes of documents?

Multinomial/Unigram Language Models

Language model built by multinomial distribution on single

terms (i.e., unigram) in the vocabulary

id

Examples:

Five words in vocabulary (sport, basketball, ticket, finance, stock)

For a document , its language mode is:

{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}

Formally:

The language model is: {Pi(w) for any word w in vocabulary V}

( ) 1 0 ( ) 1i k i k

k

P w P w


2d


1d



3d

Multinomial Model for

1d

Multinomial/Unigram Language Models

Multinomial Model for

2d Multinomial

Model for 3d

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation: Find model parameters that make generation likelihood

reach maximum:

There are K words in vocabulary, w1...wK (e.g., 5)

Data: one document with counts tfi(w1), …, tfi(wK), and

length | |

Model: multinomial M with parameters {pi(wk)}

Likelihood: Pr( | M)

1 Id ,...,d

id

M*=argmaxMPr( |M)1 Id ,...,d

id

M*=argmaxMPr(D|M)1 Id ,...,d

id

id


( ) ( )

1 11

'

'

| |( | ) ( ) ( )

( )... ( )

( | ) log ( | ) ( ) log ( )

( | ) ( ) log ( ) ( ( ) 1)

( )0 ( )

( ) ( )

i k i k

K Ki tf w tf w

i i k i kk ki i K

i i i k i kk

i i k i k i kk k

i ki k

i k i k

dp d M p w p w

tf w tf w

l d M p d M tf w p w

l d M tf w p w p w

tf w tlp w

p w p w

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%

( )

( )( ) 1, ( ) | | , ( )

| |

i k

i kii k i k i k

ik k

f w

c wSince p w tf w d So p w

d

%%%%%%%%%%%%%%

%%%%%%%%%%%%%%

Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate


2d


1d



3d

(psp, pb, pt, pf, pst) =

(0.5,0.25,0.25,0,0)


(0.2,0.2,0.4,0.2,0)


(0,0,0,0.5,0.5)


Maximum Likelihood Estimation: Assign zero probabilities to unseen words in small sample

1 Id ,...,d


id

A specific example:

Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin;

A document generates sequence of two words or draw a coin for many times

1 2( ) ( )1 1

1 2

Pr( | ) ( ) (1 ( ))( ) ( )

i ii tf w tf w

i i i

i i

dd M p w p w

tf w tf w

��

Only observe two words (flip the coin twice) and MLE estimators are:

“business sport” Pi(w1)=0.5

“sport sport” Pi(w1)=1 ?

“business business” Pi(w1)=0 ?


A specific example: Only observe two words (flip the coin twice) and MLE estimators are:

“business sport” Pi(w1)*=0.5

“sport sport” Pi(w1)*=1 ?

“business business” Pi(w1)*=0 ?

Data sparseness problem

Maximum a posterior (MAP) estimation

Shrinkage

Bayesian ensemble approach

Solution to Sparse Data Problems

Maximum A Posterior (MAP) Estimation

Maximum A Posterior Estimation: Select a model that maximizes the probability of model given

observed data

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pr(M): Prior belief/knowledge

Use prior Pr(M) to avoid zero probabilities

id

A specific examples:

Only two words in vocabulary (sport, business)

For a document :

1 2( ) ( )1 2

1 2

Pr( | ) ( ) ( ) Pr( ) ( )

i ii tf w tf w

i i i

i i

dM d p w p w M

tf w tf w

��

Prior Distribution


Maximum A Posterior Estimation: Introduce prior on the multinomial distribution

Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased

Use Dirichlet prior on p(w)

(x) is gamma function1

0( )

( 1) ! if

t xx e t dx

n n n

Z

111

1

( )( | , , ) ( ) , ( ) 1, 0 ( ) 1

( ) ( )kK

K i k i k i kikkK

Dir p p w p w p w

��

Hyper-parameters Constant for pK


2 21 1Pr( ) ( ) (1 ( ))M p w p w

For the two word example:

a Dirichlet prior

P(w

1)2

(1-P

(w1)2

)

Maximum A Posterior:

1 Id ,...,d

1 2 1 2

1 1 2 2

( ) ( ) 1 11 1 1 1

( ) 1 ( ) 11 1

Pr( | ) Pr( ) ( ) (1 ( )) ( ) ( )

( ) (1 ( ))

i i

i i

tf w tf wi i i i i

tf w tf wi i

d M M p w p w p w p w

p w p w

��


M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

Pseudo Counts

1 1 2 2

1

( ) 1 ( ) 1*1 1

( )arg max ( ) (1 ( ))i i

i

tf w tf wi i

p wM p w p w


A specific example: Only observe two words (flip a coin twice):


times

P(w

1)2

(1-P

(w1)2

)


A specific example: Only observe two words (flip a coin twice):


1 1

11 1 2 2

( ) 1( )*

( ) 1 ( ) 1

2 3 1 4 2

2 3 1 0 3 1 6 3

i

i i

tf wp w

tf w tf w

MAP EstimationUnigram Language Model

Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

How to set the parameters for Dirichlet prior


Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

There are K terms in the vocabulary:

111

1

( )( | , , ) ( ) , ( ) 1, 0 ( ) 1

( ) ( )kK

K i k i k i kikkK

Dir p p w p w p w

��

Hyper-parameters Constant for pK

1: { ( ),...., ( )}, ( ) 1, 0 ( ) 1i K i i k i kik

Multinomial p p w p w p w p w ��

* ( ) 1( )

( ( ) 1)i k k

kii k k

k

tf wp w

tf w

��


MAP Estimation for unigram language model:

* ( ) 11

1

( )arg max ( ) ( )

( ) ( )

. ( ) 1, 0 ( ) 1

i k ktf wKi k i k

p k kK

i k i kk

p p w p w

st p w p w

��

��

Use Lagrange Multiplier; Set derivative to 0

( ) 1arg max ( )

. ( ) 1, 0 ( ) 1

i k ktf wi k

p k

i k i kk

p w

st p w p w

��

Pseudo counts set by hyper-parameters

* ( ) 1( )

( ( ) 1)i k k

kii k k

k

tf wp w

tf w

��




How to determine the appropriate value for hyper-parameters?

When nothing observed from a document

* 1( )

( 1)k

kik

k

p w

��

What is most likely pi(wk) without looking at the content of the document?


MAP Estimation for unigram language model: What is most likely pi(wk) without looking at the content of the document?

The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection:

–{p(w1|c), p(w2|c),…, p(wK|c)}

Without any information, guess the behavior of one member on the behavior of whole population

* 1( ) 1

( 1)k

k c k k c kik

k

p w p w p w

��Constant

* ( ) ( )( )

( )i k c k

kii k

k

tf w p wp w

tf w

��



* ( ) ( )1

1

( )arg max ( ) ( )

( ) ( )

. ( ) 1, 0 ( ) 1

i k c ktf w p wKi k i k

p k kK

i k i kk

p p w p w

st p w p w

��

��


( ) ( )arg max ( )

. ( ) 1, 0 ( ) 1

i k c ktf w p wi k

p k

i k i kk

p w

st p w p w

��

Pseudo counts

Pseudo document length


Step 0: compute the probability on whole collection based collection unigram language model

Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��

( )

( )i k

ic i

i

i

tf w

p wd

��

Dirichlet MAP Estimation for unigram language model:

id��

id��


Step 2: For a given query ={tfq(w1),…, tfq(wk)}

Dirichlet MAP Estimation for unigram language model:

For each document , compute likelihood

The larger the likelihood, the more relevant the document is to the query

( )

( )

1 1

( ) ( )( | ) ( | )

q k

q k

tf wK K

tf w i k c ki ii

ik k

tf w p wp q d p w d

d

��

q

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk

sim q d tf w tf w idf w norm d

��

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��

��

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

?

TF-IDF Weighting:

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk


��

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��

��



TF-IDF Weighting:

1

( )log ( | ) ( ) log 1 log( ) log ( )

( )i k

i iq k c kc kk

tf wp q d tf w d p w

p w

��

1

( ) ( )( ) log ( ) log( )

( )c k i k

iq k c kc kk

p w tf wtf w p w d

p w

��

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��

��

1

log ( | ) ( ) log ( ) ( ) log( )i iq k i k c kk

p q d tf w tf w p w d

��



1

( )( ) log 1 log ( ) log( )

( )i k

iq k c kc kk

tf wtf w p w d

p w

��

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk


��

1

( )log ( | ) ( ) log 1 log( ) log ( )

( )i k

i iq k c kc kk

tf wp q d tf w d p w

p w

��



TF-IDF Weighting:

Irrelevant part

1

( )log ( | ) ( ) log 1 log( )

( )i k

i iq kc kk

tf wp q d tf w d

p w

��

( )log 1

( )i k

c k

tf w

p w



Look at the tf.idf part( )

( ) log 1( )

i ki k

c k

tf wtf w

p w

( )( ) log 1

( )i k

c kc k

tf wp w

p w

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��

Dirichlet Smoothing Hyper-Parameter

Dirichlet Smoothing: Hyper-parameter

When is very small, approach MLE estimator When is very large, approach probability on whole collection

How to set appropriate ?

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��


Leave One out Validation:

1 11 1

( ) 1 ( )( | / )

1i c

iii

tf w p wp w d w

d

��

Leave w1 out1 1( | / )ip w d w��

...

wj

w1

( | / )ij jp w d w��

...

( ) 1 ( )( | / )

1

i j c jii j j

i

tf w p wp w d w

d

��

Leave wj out

......

*

1arg max ( , )l C


Leave One out Validation:

11

( ) 1 ( )( , ) log

1

idi j c j

iij

tf w p wl d

d

��

��

Leave all words out one by one for a document

11 1

( ) 1 ( )( , ) log

1

idCi j c j

ii j

tf w p wl C

d

��

��

Do the procedure for all documents in a collection

Find appropriate

1 1( | / )ip w d w��

...

wj

w1

( | / )ij jp w d w��

...


What type of document/collection would get large ?

– Most documents use similar vocabulary and wording pattern as

the whole collection

What type of document/collection would get small ?

– Most documents use different vocabulary and wording pattern

than the whole collection

Shrinkage

Maximum Likelihood (MLE) builds model purely on document

data and generates query word Model may not be accurate when document is short (many unseen

words)

Shrinkage estimator builds more reliable model by consulting

more general models (e.g., collection language model)Example: Estimate P(Lung_Cancer|Smoke)

West Lafayette Indiana U.S.

Shrinkage

( )( ) (1 ) ( )i ki k c k

i

tf wp w p w

d ��

Jelinek Mercer Smoothing Assume for each word, with probability , it is generated from

document language model (MLE), with probability 1- , it is generated from collection language model (MLE)

Linear interpolation between document language model and collection language model

JM Smoothing:

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��

( )( ) (1 ) ( )i ki k c k

i

tf wp w p w

d ��

Shrinkage

Relationship between JM Smoothing and Dirichlet Smoothing

JM Smoothing:

1( ) ( )i k c k

i

tf w p wd

��

( )1( )

i i k

c ki i

d tf wp w

d d

��

��( )

( )i

i kc k

i i i

d tf wp w

d d d

��

��

Model Based Feedback

Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models

Kullback-Leibler (KL) Divergence between two probabilistic distributions

( )( ) log

( )x

p xKL p q p x

q x

��

It is the distance between two probabilistic distributions

It is always larger than zero

How to prove it ?


Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models

( , ) ( )

( )( ) log

( )

( ) log ( ) log

i i

w i

iw w

Sim q d KL q d

q wq w

p w

q w p w q w q w

��

Loglikelihood of query generation probability

Document independent constant

Generalize query representation to be a distribution (fractional term weighting)

Estimating language model

id

Language Model for id

Estimate the generation probability of Pr( | )q

id

Retrieval results

q

Calculate KL Divergence

Retrieval results

q

Estimating query language model

Language Model for q

( )iKL q d��

Estimating document language model

id



Calculate KL Divergence

Retrieval results

q

Estimating query language model

Language Model for q

( )iKL q d��

Estimating document language model

id


Feedback Documents from initial results

Language Model for qF

New Query Model'

q (1 )q qF

0

No feedback'

q q Full feedback

'q qF

1


*

1

arg max ( , )

arg max log ( ) (1 ) ( )

F

F

Fq

n

F i C iq i

q l X

q w p w

Assume there is a generative model to produce each word within feedback document(s)

For each word in feedback document(s), given

w

w

Feedback Documents

qF(w)

PC(w)1-

Flip a coin

Background model

Topic words

Model Based Feedback: Estimate Fq

For each word, there is a hidden variable telling which language model it comes from

the 0.12to 0.05it 0.04a 0.02…sport 0.0001basketball 0.00005

…

Background Model

pC(w|C)

…sport =? basketball =? game =? player =?

…

Unknownquery topicp(w|F)=?

“Basketball”

1-=0.8

=0.2

FeedbackDocuments

If we know the value of hidden variable of each word ...

MLEEstimator


For each word, the hidden variable Z i = {1 (feedback), 0 (background}

Step1: estimate hidden variable based current on model parameter (Expectation)

( )

( )

( 1) ( | 1)( 1| )

( 1) ( | 1) ( 0) ( | 0)

( )

( ) (1 ) ( | )

i i ii i

i i i i i i

tF i

tF i C i

p z p w zp z w

p z p w z p z p w z

q w

q w p w C

E-step

Step2: Update model parameters based on the guess in step1 (Maximization)

the (0.1) basketball (0.7) game (0.6) is (0.2) ….


( 1) ( , ) ( 1| )( | )

( , ) ( 1| )t i i iF i F

j j jj

c w F p z wq w

c w F p z w

M-Step

Expectation-Maximization (EM) algorithm0

Fq

Step1: (Expectation)

Step2: (Maximization)

( )

( )

( )( 1| )

( ) (1 ) ( | )

tF i

i i tF i C i

q wp z w

q w p w C

( 1) ( , ) ( 1| )( | )

( , ) ( 1| )t i i iF i F

j j jj

c w F p z wq w

c w F p z w

Step 0: Initialize values of


Give =0.5

Properties of parameter If is close to 0, most common words can be generated from collection language model, so more topic words in query

language mode


If is close to 1, query language model has to generate most common words, so fewer topic words in query language mode

Introduction to language model

Retrieval Model: Language Model

Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing

Unigram language model

Document language model estimation

Model-based feedback

cs590i: information retrieval

Documents

language modelintroduction

language modes of documents

data mining

parallel issues

model parameters

length model

documentc vipin kumar

reliablec vipin kumar