cs590i: information retrieval

51
CS590I: Information Retrieval CS-590I Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University

Upload: neve-fry

Post on 31-Dec-2015

25 views

Category:

Documents


0 download

DESCRIPTION

CS590I: Information Retrieval. CS-590I Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University. Retrieval Model: Language Model. Introduction to language model. Unigram language model. Document language model estimation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS590I: Information Retrieval

CS590I: Information Retrieval

CS-590I

Information Retrieval

Retrieval Models: Language models

Luo Si

Department of Computer Science

Purdue University

Page 2: CS590I: Information Retrieval

Introduction to language model

Retrieval Model: Language Model

Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing

Unigram language model

Document language model estimation

Model-based feedback

Page 3: CS590I: Information Retrieval

Language Models: Motivation

Vector space model for information retrieval Documents and queries are vectors in the term space Relevance is measure by the similarity between document

vectors and query vector

Problems for vector space model Ad-hoc term weighting schemes Ad-hoc similarity measurement

No justification of relationship between relevance and similarity

We need more principled retrieval models…

Page 4: CS590I: Information Retrieval

Introduction to Language Models:

Language model can be created for any language sample A document A collection of documents Sentence, paragraph, chapter, query…

The size of language sample affects the quality of language model

Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be reliable

Page 5: CS590I: Information Retrieval

Introduction to Language Models:

A document language model defines a probability distribution over indexed terms

E.g., the probability of generating a term

Sum of the probabilities is 1

A query can be seen as observed data from unknown models Query also defines a language model (more on this later)

How might the models be used for IR? Rank documents by Pr( | )

Rank documents by language models of and based on kullback-Leibler (KL) divergence between the models (come later)

id

q

id

q

Page 6: CS590I: Information Retrieval

Language Model for IR: Example

Estimating language model for each document

sport, basketball, ticket, sport

1d

basketball, ticket, finance, ticket, sport

2d

stock, finance, finance, stock

3d

Language Model for 1d

Language Model for 2d

Language Model for 3d

Estimate the generation probability of Pr( | )q

idq

sport, basketball

Generate retrieval results

Page 7: CS590I: Information Retrieval

Language Models

Three basic problems for language models

What type of probabilistic distribution can be used to

construct language models?

How to estimate the parameters of the distribution of the

language models?

How to compute the likelihood of generating queries given

the language modes of documents?

Page 8: CS590I: Information Retrieval

Multinomial/Unigram Language Models

Language model built by multinomial distribution on single

terms (i.e., unigram) in the vocabulary

id

Examples:

Five words in vocabulary (sport, basketball, ticket, finance, stock)

For a document , its language mode is:

{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}

Formally:

The language model is: {Pi(w) for any word w in vocabulary V}

( ) 1 0 ( ) 1i k i k

k

P w P w

Page 9: CS590I: Information Retrieval

Estimating language model for each document

2d

sport, basketball, ticket, sport

1d

basketball, ticket, finance, ticket, sport

stock, finance, finance, stock

3d

Multinomial Model for

1d

Multinomial/Unigram Language Models

Multinomial Model for

2d Multinomial

Model for 3d

Page 10: CS590I: Information Retrieval

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation: Find model parameters that make generation likelihood

reach maximum:

There are K words in vocabulary, w1...wK (e.g., 5)

Data: one document with counts tfi(w1), …, tfi(wK), and

length | |

Model: multinomial M with parameters {pi(wk)}

Likelihood: Pr( | M)

1 Id ,...,d

id

M*=argmaxMPr( |M)1 Id ,...,d

id

M*=argmaxMPr(D|M)1 Id ,...,d

id

id

Page 11: CS590I: Information Retrieval

Maximum Likelihood Estimation (MLE)

( ) ( )

1 11

'

'

| |( | ) ( ) ( )

( )... ( )

( | ) log ( | ) ( ) log ( )

( | ) ( ) log ( ) ( ( ) 1)

( )0 ( )

( ) ( )

i k i k

K Ki tf w tf w

i i k i kk ki i K

i i i k i kk

i i k i k i kk k

i ki k

i k i k

dp d M p w p w

tf w tf w

l d M p d M tf w p w

l d M tf w p w p w

tf w tlp w

p w p w

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%

( )

( )( ) 1, ( ) | | , ( )

| |

i k

i kii k i k i k

ik k

f w

c wSince p w tf w d So p w

d

%%%%%%%%%%%%%%

%%%%%%%%%%%%%%

Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate

Page 12: CS590I: Information Retrieval

Estimating language model for each document

2d

sport, basketball, ticket, sport

1d

basketball, ticket, finance, ticket, sport

stock, finance, finance, stock

3d

(psp, pb, pt, pf, pst) =

(0.5,0.25,0.25,0,0)

(psp, pb, pt, pf, pst) =

(0.2,0.2,0.4,0.2,0)

(psp, pb, pt, pf, pst) =

(0,0,0,0.5,0.5)

Maximum Likelihood Estimation (MLE)

Page 13: CS590I: Information Retrieval

Maximum Likelihood Estimation: Assign zero probabilities to unseen words in small sample

1 Id ,...,d

Maximum Likelihood Estimation (MLE)

id

A specific example:

Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin;

A document generates sequence of two words or draw a coin for many times

1 2( ) ( )1 1

1 2

Pr( | ) ( ) (1 ( ))( ) ( )

i ii tf w tf w

i i i

i i

dd M p w p w

tf w tf w

����������������������������

Only observe two words (flip the coin twice) and MLE estimators are:

“business sport” Pi(w1)=0.5

“sport sport” Pi(w1)=1 ?

“business business” Pi(w1)=0 ?

Page 14: CS590I: Information Retrieval

Maximum Likelihood Estimation (MLE)

A specific example: Only observe two words (flip the coin twice) and MLE estimators are:

“business sport” Pi(w1)*=0.5

“sport sport” Pi(w1)*=1 ?

“business business” Pi(w1)*=0 ?

Data sparseness problem

Page 15: CS590I: Information Retrieval

Maximum a posterior (MAP) estimation

Shrinkage

Bayesian ensemble approach

Solution to Sparse Data Problems

Page 16: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

Maximum A Posterior Estimation: Select a model that maximizes the probability of model given

observed data

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pr(M): Prior belief/knowledge

Use prior Pr(M) to avoid zero probabilities

id

A specific examples:

Only two words in vocabulary (sport, business)

For a document :

1 2( ) ( )1 2

1 2

Pr( | ) ( ) ( ) Pr( ) ( )

i ii tf w tf w

i i i

i i

dM d p w p w M

tf w tf w

����������������������������

Prior Distribution

Page 17: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

Maximum A Posterior Estimation: Introduce prior on the multinomial distribution

Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased

Use Dirichlet prior on p(w)

(x) is gamma function1

0( )

( 1) ! if

t xx e t dx

n n n

Z

111

1

( )( | , , ) ( ) , ( ) 1, 0 ( ) 1

( ) ( )kK

K i k i k i kikkK

Dir p p w p w p w

��������������

Hyper-parameters Constant for pK

Page 18: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

2 21 1Pr( ) ( ) (1 ( ))M p w p w

For the two word example:

a Dirichlet prior

P(w

1)2

(1-P

(w1)2

)

Page 19: CS590I: Information Retrieval

Maximum A Posterior:

1 Id ,...,d

1 2 1 2

1 1 2 2

( ) ( ) 1 11 1 1 1

( ) 1 ( ) 11 1

Pr( | ) Pr( ) ( ) (1 ( )) ( ) ( )

( ) (1 ( ))

i i

i i

tf w tf wi i i i i

tf w tf wi i

d M M p w p w p w p w

p w p w

��������������

Maximum A Posterior (MAP) Estimation

M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)

Pseudo Counts

1 1 2 2

1

( ) 1 ( ) 1*1 1

( )arg max ( ) (1 ( ))i i

i

tf w tf wi i

p wM p w p w

Page 20: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

A specific example: Only observe two words (flip a coin twice):

“sport sport” Pi(w1)*=1 ?

times

P(w

1)2

(1-P

(w1)2

)

Page 21: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

A specific example: Only observe two words (flip a coin twice):

“sport sport” Pi(w1)*=1 ?

1 1

11 1 2 2

( ) 1( )*

( ) 1 ( ) 1

2 3 1 4 2

2 3 1 0 3 1 6 3

i

i i

tf wp w

tf w tf w

Page 22: CS590I: Information Retrieval

MAP EstimationUnigram Language Model

Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

How to set the parameters for Dirichlet prior

Page 23: CS590I: Information Retrieval

MAP EstimationUnigram Language Model

Maximum A Posterior Estimation:

Use Dirichlet prior for multinomial distribution

There are K terms in the vocabulary:

111

1

( )( | , , ) ( ) , ( ) 1, 0 ( ) 1

( ) ( )kK

K i k i k i kikkK

Dir p p w p w p w

��������������

Hyper-parameters Constant for pK

1: { ( ),...., ( )}, ( ) 1, 0 ( ) 1i K i i k i kik

Multinomial p p w p w p w p w ��������������

Page 24: CS590I: Information Retrieval

* ( ) 1( )

( ( ) 1)i k k

kii k k

k

tf wp w

tf w

��������������

MAP EstimationUnigram Language Model

MAP Estimation for unigram language model:

* ( ) 11

1

( )arg max ( ) ( )

( ) ( )

. ( ) 1, 0 ( ) 1

i k ktf wKi k i k

p k kK

i k i kk

p p w p w

st p w p w

��������������

��������������

Use Lagrange Multiplier; Set derivative to 0

( ) 1arg max ( )

. ( ) 1, 0 ( ) 1

i k ktf wi k

p k

i k i kk

p w

st p w p w

��������������

Pseudo counts set by hyper-parameters

Page 25: CS590I: Information Retrieval

* ( ) 1( )

( ( ) 1)i k k

kii k k

k

tf wp w

tf w

��������������

MAP EstimationUnigram Language Model

MAP Estimation for unigram language model:

Use Lagrange Multiplier; Set derivative to 0

How to determine the appropriate value for hyper-parameters?

When nothing observed from a document

* 1( )

( 1)k

kik

k

p w

��������������

What is most likely pi(wk) without looking at the content of the document?

Page 26: CS590I: Information Retrieval

MAP EstimationUnigram Language Model

MAP Estimation for unigram language model: What is most likely pi(wk) without looking at the content of the document?

The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection:

–{p(w1|c), p(w2|c),…, p(wK|c)}

Without any information, guess the behavior of one member on the behavior of whole population

* 1( ) 1

( 1)k

k c k k c kik

k

p w p w p w

��������������Constant

Page 27: CS590I: Information Retrieval

* ( ) ( )( )

( )i k c k

kii k

k

tf w p wp w

tf w

��������������

MAP EstimationUnigram Language Model

MAP Estimation for unigram language model:

* ( ) ( )1

1

( )arg max ( ) ( )

( ) ( )

. ( ) 1, 0 ( ) 1

i k c ktf w p wKi k i k

p k kK

i k i kk

p p w p w

st p w p w

��������������

��������������

Use Lagrange Multiplier; Set derivative to 0

( ) ( )arg max ( )

. ( ) 1, 0 ( ) 1

i k c ktf w p wi k

p k

i k i kk

p w

st p w p w

��������������

Pseudo counts

Pseudo document length

Page 28: CS590I: Information Retrieval

Maximum A Posterior (MAP) Estimation

Step 0: compute the probability on whole collection based collection unigram language model

Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��������������

( )

( )i k

ic i

i

i

tf w

p wd

��������������

Dirichlet MAP Estimation for unigram language model:

id��������������

Page 29: CS590I: Information Retrieval

id��������������

Maximum A Posterior (MAP) Estimation

Step 2: For a given query ={tfq(w1),…, tfq(wk)}

Dirichlet MAP Estimation for unigram language model:

For each document , compute likelihood

The larger the likelihood, the more relevant the document is to the query

( )

( )

1 1

( ) ( )( | ) ( | )

q k

q k

tf wK K

tf w i k c ki ii

ik k

tf w p wp q d p w d

d

������������������������������������������

q

Page 30: CS590I: Information Retrieval

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk

sim q d tf w tf w idf w norm d

����������������������������

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��������������

��������������

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

?

TF-IDF Weighting:

Page 31: CS590I: Information Retrieval

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk

sim q d tf w tf w idf w norm d

����������������������������

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��������������

��������������

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

TF-IDF Weighting:

1

( )log ( | ) ( ) log 1 log( ) log ( )

( )i k

i iq k c kc kk

tf wp q d tf w d p w

p w

����������������������������

Page 32: CS590I: Information Retrieval

1

( ) ( )( ) log ( ) log( )

( )c k i k

iq k c kc kk

p w tf wtf w p w d

p w

��������������

( )

1

( ) ( )( | )

q ktf wK

i k c ki

ik

tf w p wp q d

d

��������������

��������������

1

log ( | ) ( ) log ( ) ( ) log( )i iq k i k c kk

p q d tf w tf w p w d

����������������������������

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

1

( )( ) log 1 log ( ) log( )

( )i k

iq k c kc kk

tf wtf w p w d

p w

��������������

Page 33: CS590I: Information Retrieval

1

( , ) ( ) ( ) ( ) ( )K

i iq k i k kk

sim q d tf w tf w idf w norm d

����������������������������

1

( )log ( | ) ( ) log 1 log( ) log ( )

( )i k

i iq k c kc kk

tf wp q d tf w d p w

p w

����������������������������

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

TF-IDF Weighting:

Irrelevant part

1

( )log ( | ) ( ) log 1 log( )

( )i k

i iq kc kk

tf wp q d tf w d

p w

����������������������������

Page 34: CS590I: Information Retrieval

( )log 1

( )i k

c k

tf w

p w

Dirichlet Smoothing & TF-IDF

Dirichlet Smoothing:

Look at the tf.idf part( )

( ) log 1( )

i ki k

c k

tf wtf w

p w

( )( ) log 1

( )i k

c kc k

tf wp w

p w

Page 35: CS590I: Information Retrieval

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��������������

Dirichlet Smoothing Hyper-Parameter

Dirichlet Smoothing: Hyper-parameter

When is very small, approach MLE estimator When is very large, approach probability on whole collection

How to set appropriate ?

Page 36: CS590I: Information Retrieval

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��������������

Dirichlet Smoothing Hyper-Parameter

Leave One out Validation:

1 11 1

( ) 1 ( )( | / )

1i c

iii

tf w p wp w d w

d

����������������������������

Leave w1 out1 1( | / )ip w d w��������������

...

wj

w1

( | / )ij jp w d w��������������

...

( ) 1 ( )( | / )

1

i j c jii j j

i

tf w p wp w d w

d

����������������������������

Leave wj out

......

Page 37: CS590I: Information Retrieval

*

1arg max ( , )l C

Dirichlet Smoothing Hyper-Parameter

Leave One out Validation:

11

( ) 1 ( )( , ) log

1

idi j c j

iij

tf w p wl d

d

��������������

����������������������������

Leave all words out one by one for a document

11 1

( ) 1 ( )( , ) log

1

idCi j c j

ii j

tf w p wl C

d

��������������

��������������

Do the procedure for all documents in a collection

Find appropriate

1 1( | / )ip w d w��������������

...

wj

w1

( | / )ij jp w d w��������������

...

Page 38: CS590I: Information Retrieval

Dirichlet Smoothing Hyper-Parameter

What type of document/collection would get large ?

– Most documents use similar vocabulary and wording pattern as

the whole collection

What type of document/collection would get small ?

– Most documents use different vocabulary and wording pattern

than the whole collection

Page 39: CS590I: Information Retrieval

Shrinkage

Maximum Likelihood (MLE) builds model purely on document

data and generates query word Model may not be accurate when document is short (many unseen

words)

Shrinkage estimator builds more reliable model by consulting

more general models (e.g., collection language model)Example: Estimate P(Lung_Cancer|Smoke)

West Lafayette Indiana U.S.

Page 40: CS590I: Information Retrieval

Shrinkage

( )( ) (1 ) ( )i ki k c k

i

tf wp w p w

d ��������������

Jelinek Mercer Smoothing Assume for each word, with probability , it is generated from

document language model (MLE), with probability 1- , it is generated from collection language model (MLE)

Linear interpolation between document language model and collection language model

JM Smoothing:

Page 41: CS590I: Information Retrieval

( ) ( )( ) i k c ki k

i

tf w p wp w

d

��������������

( )( ) (1 ) ( )i ki k c k

i

tf wp w p w

d ��������������

Shrinkage

Relationship between JM Smoothing and Dirichlet Smoothing

JM Smoothing:

1( ) ( )i k c k

i

tf w p wd

��������������

( )1( )

i i k

c ki i

d tf wp w

d d

��������������

����������������������������( )

( )i

i kc k

i i i

d tf wp w

d d d

��������������

������������������������������������������

Page 42: CS590I: Information Retrieval

Model Based Feedback

Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models

Kullback-Leibler (KL) Divergence between two probabilistic distributions

( )( ) log

( )x

p xKL p q p x

q x

����������������������������

It is the distance between two probabilistic distributions

It is always larger than zero

How to prove it ?

Page 43: CS590I: Information Retrieval

Model Based Feedback

Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models

( , ) ( )

( )( ) log

( )

( ) log ( ) log

i i

w i

iw w

Sim q d KL q d

q wq w

p w

q w p w q w q w

��������������������������������������� ���

Loglikelihood of query generation probability

Document independent constant

Generalize query representation to be a distribution (fractional term weighting)

Page 44: CS590I: Information Retrieval

Estimating language model

id

Language Model for id

Estimate the generation probability of Pr( | )q

id

Retrieval results

q

Calculate KL Divergence

Retrieval results

q

Estimating query language model

Language Model for q

( )iKL q d������������� �

Estimating document language model

id

Language Model for id

Model Based Feedback

Page 45: CS590I: Information Retrieval

Calculate KL Divergence

Retrieval results

q

Estimating query language model

Language Model for q

( )iKL q d������������� �

Estimating document language model

id

Language Model for id

Feedback Documents from initial results

Language Model for qF

New Query Model'

q (1 )q qF

0

No feedback'

q q Full feedback

'q qF

1

Model Based Feedback

Page 46: CS590I: Information Retrieval

*

1

arg max ( , )

arg max log ( ) (1 ) ( )

F

F

Fq

n

F i C iq i

q l X

q w p w

Assume there is a generative model to produce each word within feedback document(s)

For each word in feedback document(s), given

w

w

Feedback Documents

qF(w)

PC(w)1-

Flip a coin

Background model

Topic words

Model Based Feedback: Estimate Fq

Page 47: CS590I: Information Retrieval

For each word, there is a hidden variable telling which language model it comes from

the 0.12to 0.05it 0.04a 0.02…sport 0.0001basketball 0.00005

Background Model

pC(w|C)

…sport =? basketball =? game =? player =?

Unknownquery topicp(w|F)=?

“Basketball”

1-=0.8

=0.2

FeedbackDocuments

If we know the value of hidden variable of each word ...

MLEEstimator

Model Based Feedback: Estimate Fq

Page 48: CS590I: Information Retrieval

For each word, the hidden variable Z i = {1 (feedback), 0 (background}

Step1: estimate hidden variable based current on model parameter (Expectation)

( )

( )

( 1) ( | 1)( 1| )

( 1) ( | 1) ( 0) ( | 0)

( )

( ) (1 ) ( | )

i i ii i

i i i i i i

tF i

tF i C i

p z p w zp z w

p z p w z p z p w z

q w

q w p w C

E-step

Step2: Update model parameters based on the guess in step1 (Maximization)

the (0.1) basketball (0.7) game (0.6) is (0.2) ….

Model Based Feedback: Estimate Fq

( 1) ( , ) ( 1| )( | )

( , ) ( 1| )t i i iF i F

j j jj

c w F p z wq w

c w F p z w

M-Step

Page 49: CS590I: Information Retrieval

Expectation-Maximization (EM) algorithm0

Fq

Step1: (Expectation)

Step2: (Maximization)

( )

( )

( )( 1| )

( ) (1 ) ( | )

tF i

i i tF i C i

q wp z w

q w p w C

( 1) ( , ) ( 1| )( | )

( , ) ( 1| )t i i iF i F

j j jj

c w F p z wq w

c w F p z w

Step 0: Initialize values of

Model Based Feedback: Estimate Fq

Give =0.5

Page 50: CS590I: Information Retrieval

Properties of parameter If is close to 0, most common words can be generated from collection language model, so more topic words in query

language mode

Model Based Feedback: Estimate Fq

If is close to 1, query language model has to generate most common words, so fewer topic words in query language mode

Page 51: CS590I: Information Retrieval

Introduction to language model

Retrieval Model: Language Model

Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing

Unigram language model

Document language model estimation

Model-based feedback