cs590i: information retrieval
DESCRIPTION
CS590I: Information Retrieval. CS-590I Information Retrieval Retrieval Models: Language models Luo Si Department of Computer Science Purdue University. Retrieval Model: Language Model. Introduction to language model. Unigram language model. Document language model estimation. - PowerPoint PPT PresentationTRANSCRIPT
CS590I: Information Retrieval
CS-590I
Information Retrieval
Retrieval Models: Language models
Luo Si
Department of Computer Science
Purdue University
Introduction to language model
Retrieval Model: Language Model
Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing
Unigram language model
Document language model estimation
Model-based feedback
Language Models: Motivation
Vector space model for information retrieval Documents and queries are vectors in the term space Relevance is measure by the similarity between document
vectors and query vector
Problems for vector space model Ad-hoc term weighting schemes Ad-hoc similarity measurement
No justification of relationship between relevance and similarity
We need more principled retrieval models…
Introduction to Language Models:
Language model can be created for any language sample A document A collection of documents Sentence, paragraph, chapter, query…
The size of language sample affects the quality of language model
Long documents have more accurate model Short documents have less accurate model Model for sentence, paragraph or query may not be reliable
Introduction to Language Models:
A document language model defines a probability distribution over indexed terms
E.g., the probability of generating a term
Sum of the probabilities is 1
A query can be seen as observed data from unknown models Query also defines a language model (more on this later)
How might the models be used for IR? Rank documents by Pr( | )
Rank documents by language models of and based on kullback-Leibler (KL) divergence between the models (come later)
id
q
id
q
Language Model for IR: Example
Estimating language model for each document
sport, basketball, ticket, sport
1d
basketball, ticket, finance, ticket, sport
2d
stock, finance, finance, stock
3d
Language Model for 1d
Language Model for 2d
Language Model for 3d
Estimate the generation probability of Pr( | )q
idq
sport, basketball
Generate retrieval results
Language Models
Three basic problems for language models
What type of probabilistic distribution can be used to
construct language models?
How to estimate the parameters of the distribution of the
language models?
How to compute the likelihood of generating queries given
the language modes of documents?
Multinomial/Unigram Language Models
Language model built by multinomial distribution on single
terms (i.e., unigram) in the vocabulary
id
Examples:
Five words in vocabulary (sport, basketball, ticket, finance, stock)
For a document , its language mode is:
{Pi(“sport”), Pi(“basketball”), Pi(“ticket”), Pi(“finance”), Pi(“stock”)}
Formally:
The language model is: {Pi(w) for any word w in vocabulary V}
( ) 1 0 ( ) 1i k i k
k
P w P w
Estimating language model for each document
2d
sport, basketball, ticket, sport
1d
basketball, ticket, finance, ticket, sport
stock, finance, finance, stock
3d
Multinomial Model for
1d
Multinomial/Unigram Language Models
Multinomial Model for
2d Multinomial
Model for 3d
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation: Find model parameters that make generation likelihood
reach maximum:
There are K words in vocabulary, w1...wK (e.g., 5)
Data: one document with counts tfi(w1), …, tfi(wK), and
length | |
Model: multinomial M with parameters {pi(wk)}
Likelihood: Pr( | M)
1 Id ,...,d
id
M*=argmaxMPr( |M)1 Id ,...,d
id
M*=argmaxMPr(D|M)1 Id ,...,d
id
id
Maximum Likelihood Estimation (MLE)
( ) ( )
1 11
'
'
| |( | ) ( ) ( )
( )... ( )
( | ) log ( | ) ( ) log ( )
( | ) ( ) log ( ) ( ( ) 1)
( )0 ( )
( ) ( )
i k i k
K Ki tf w tf w
i i k i kk ki i K
i i i k i kk
i i k i k i kk k
i ki k
i k i k
dp d M p w p w
tf w tf w
l d M p d M tf w p w
l d M tf w p w p w
tf w tlp w
p w p w
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%
( )
( )( ) 1, ( ) | | , ( )
| |
i k
i kii k i k i k
ik k
f w
c wSince p w tf w d So p w
d
%%%%%%%%%%%%%%
%%%%%%%%%%%%%%
Use Lagrange multiplier approach Set partial derivatives to zero Get maximum likelihood estimate
Estimating language model for each document
2d
sport, basketball, ticket, sport
1d
basketball, ticket, finance, ticket, sport
stock, finance, finance, stock
3d
(psp, pb, pt, pf, pst) =
(0.5,0.25,0.25,0,0)
(psp, pb, pt, pf, pst) =
(0.2,0.2,0.4,0.2,0)
(psp, pb, pt, pf, pst) =
(0,0,0,0.5,0.5)
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation: Assign zero probabilities to unseen words in small sample
1 Id ,...,d
Maximum Likelihood Estimation (MLE)
id
A specific example:
Only two words in vocabulary (w1=sport, w2=business) like (head, tail) for a coin;
A document generates sequence of two words or draw a coin for many times
1 2( ) ( )1 1
1 2
Pr( | ) ( ) (1 ( ))( ) ( )
i ii tf w tf w
i i i
i i
dd M p w p w
tf w tf w
����������������������������
Only observe two words (flip the coin twice) and MLE estimators are:
“business sport” Pi(w1)=0.5
“sport sport” Pi(w1)=1 ?
“business business” Pi(w1)=0 ?
Maximum Likelihood Estimation (MLE)
A specific example: Only observe two words (flip the coin twice) and MLE estimators are:
“business sport” Pi(w1)*=0.5
“sport sport” Pi(w1)*=1 ?
“business business” Pi(w1)*=0 ?
Data sparseness problem
Maximum a posterior (MAP) estimation
Shrinkage
Bayesian ensemble approach
Solution to Sparse Data Problems
Maximum A Posterior (MAP) Estimation
Maximum A Posterior Estimation: Select a model that maximizes the probability of model given
observed data
M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M) Pr(M): Prior belief/knowledge
Use prior Pr(M) to avoid zero probabilities
id
A specific examples:
Only two words in vocabulary (sport, business)
For a document :
1 2( ) ( )1 2
1 2
Pr( | ) ( ) ( ) Pr( ) ( )
i ii tf w tf w
i i i
i i
dM d p w p w M
tf w tf w
����������������������������
Prior Distribution
Maximum A Posterior (MAP) Estimation
Maximum A Posterior Estimation: Introduce prior on the multinomial distribution
Use prior Pr(M) to avoid zero probabilities, most of coins are more or less unbiased
Use Dirichlet prior on p(w)
(x) is gamma function1
0( )
( 1) ! if
t xx e t dx
n n n
Z
111
1
( )( | , , ) ( ) , ( ) 1, 0 ( ) 1
( ) ( )kK
K i k i k i kikkK
Dir p p w p w p w
��������������
Hyper-parameters Constant for pK
Maximum A Posterior (MAP) Estimation
2 21 1Pr( ) ( ) (1 ( ))M p w p w
For the two word example:
a Dirichlet prior
P(w
1)2
(1-P
(w1)2
)
Maximum A Posterior:
1 Id ,...,d
1 2 1 2
1 1 2 2
( ) ( ) 1 11 1 1 1
( ) 1 ( ) 11 1
Pr( | ) Pr( ) ( ) (1 ( )) ( ) ( )
( ) (1 ( ))
i i
i i
tf w tf wi i i i i
tf w tf wi i
d M M p w p w p w p w
p w p w
��������������
Maximum A Posterior (MAP) Estimation
M*=argmaxMPr(M|D)=argmaxMPr(D|M)Pr(M)
Pseudo Counts
1 1 2 2
1
( ) 1 ( ) 1*1 1
( )arg max ( ) (1 ( ))i i
i
tf w tf wi i
p wM p w p w
Maximum A Posterior (MAP) Estimation
A specific example: Only observe two words (flip a coin twice):
“sport sport” Pi(w1)*=1 ?
times
P(w
1)2
(1-P
(w1)2
)
Maximum A Posterior (MAP) Estimation
A specific example: Only observe two words (flip a coin twice):
“sport sport” Pi(w1)*=1 ?
1 1
11 1 2 2
( ) 1( )*
( ) 1 ( ) 1
2 3 1 4 2
2 3 1 0 3 1 6 3
i
i i
tf wp w
tf w tf w
MAP EstimationUnigram Language Model
Maximum A Posterior Estimation:
Use Dirichlet prior for multinomial distribution
How to set the parameters for Dirichlet prior
MAP EstimationUnigram Language Model
Maximum A Posterior Estimation:
Use Dirichlet prior for multinomial distribution
There are K terms in the vocabulary:
111
1
( )( | , , ) ( ) , ( ) 1, 0 ( ) 1
( ) ( )kK
K i k i k i kikkK
Dir p p w p w p w
��������������
Hyper-parameters Constant for pK
1: { ( ),...., ( )}, ( ) 1, 0 ( ) 1i K i i k i kik
Multinomial p p w p w p w p w ��������������
* ( ) 1( )
( ( ) 1)i k k
kii k k
k
tf wp w
tf w
��������������
MAP EstimationUnigram Language Model
MAP Estimation for unigram language model:
* ( ) 11
1
( )arg max ( ) ( )
( ) ( )
. ( ) 1, 0 ( ) 1
i k ktf wKi k i k
p k kK
i k i kk
p p w p w
st p w p w
��������������
��������������
Use Lagrange Multiplier; Set derivative to 0
( ) 1arg max ( )
. ( ) 1, 0 ( ) 1
i k ktf wi k
p k
i k i kk
p w
st p w p w
��������������
Pseudo counts set by hyper-parameters
* ( ) 1( )
( ( ) 1)i k k
kii k k
k
tf wp w
tf w
��������������
MAP EstimationUnigram Language Model
MAP Estimation for unigram language model:
Use Lagrange Multiplier; Set derivative to 0
How to determine the appropriate value for hyper-parameters?
When nothing observed from a document
* 1( )
( 1)k
kik
k
p w
��������������
What is most likely pi(wk) without looking at the content of the document?
MAP EstimationUnigram Language Model
MAP Estimation for unigram language model: What is most likely pi(wk) without looking at the content of the document?
The most likely pi(wk) without looking into the content of the document d is the unigram probability of the collection:
–{p(w1|c), p(w2|c),…, p(wK|c)}
Without any information, guess the behavior of one member on the behavior of whole population
* 1( ) 1
( 1)k
k c k k c kik
k
p w p w p w
��������������Constant
* ( ) ( )( )
( )i k c k
kii k
k
tf w p wp w
tf w
��������������
MAP EstimationUnigram Language Model
MAP Estimation for unigram language model:
* ( ) ( )1
1
( )arg max ( ) ( )
( ) ( )
. ( ) 1, 0 ( ) 1
i k c ktf w p wKi k i k
p k kK
i k i kk
p p w p w
st p w p w
��������������
��������������
Use Lagrange Multiplier; Set derivative to 0
( ) ( )arg max ( )
. ( ) 1, 0 ( ) 1
i k c ktf w p wi k
p k
i k i kk
p w
st p w p w
��������������
Pseudo counts
Pseudo document length
Maximum A Posterior (MAP) Estimation
Step 0: compute the probability on whole collection based collection unigram language model
Step 1: for each document , compute its smoothed unigram language model (Dirichlet smoothing) as
( ) ( )( ) i k c ki k
i
tf w p wp w
d
��������������
( )
( )i k
ic i
i
i
tf w
p wd
��������������
Dirichlet MAP Estimation for unigram language model:
id��������������
id��������������
Maximum A Posterior (MAP) Estimation
Step 2: For a given query ={tfq(w1),…, tfq(wk)}
Dirichlet MAP Estimation for unigram language model:
For each document , compute likelihood
The larger the likelihood, the more relevant the document is to the query
( )
( )
1 1
( ) ( )( | ) ( | )
q k
q k
tf wK K
tf w i k c ki ii
ik k
tf w p wp q d p w d
d
������������������������������������������
q
1
( , ) ( ) ( ) ( ) ( )K
i iq k i k kk
sim q d tf w tf w idf w norm d
����������������������������
( )
1
( ) ( )( | )
q ktf wK
i k c ki
ik
tf w p wp q d
d
��������������
��������������
Dirichlet Smoothing & TF-IDF
Dirichlet Smoothing:
?
TF-IDF Weighting:
1
( , ) ( ) ( ) ( ) ( )K
i iq k i k kk
sim q d tf w tf w idf w norm d
����������������������������
( )
1
( ) ( )( | )
q ktf wK
i k c ki
ik
tf w p wp q d
d
��������������
��������������
Dirichlet Smoothing & TF-IDF
Dirichlet Smoothing:
TF-IDF Weighting:
1
( )log ( | ) ( ) log 1 log( ) log ( )
( )i k
i iq k c kc kk
tf wp q d tf w d p w
p w
����������������������������
1
( ) ( )( ) log ( ) log( )
( )c k i k
iq k c kc kk
p w tf wtf w p w d
p w
��������������
( )
1
( ) ( )( | )
q ktf wK
i k c ki
ik
tf w p wp q d
d
��������������
��������������
1
log ( | ) ( ) log ( ) ( ) log( )i iq k i k c kk
p q d tf w tf w p w d
����������������������������
Dirichlet Smoothing & TF-IDF
Dirichlet Smoothing:
1
( )( ) log 1 log ( ) log( )
( )i k
iq k c kc kk
tf wtf w p w d
p w
��������������
1
( , ) ( ) ( ) ( ) ( )K
i iq k i k kk
sim q d tf w tf w idf w norm d
����������������������������
1
( )log ( | ) ( ) log 1 log( ) log ( )
( )i k
i iq k c kc kk
tf wp q d tf w d p w
p w
����������������������������
Dirichlet Smoothing & TF-IDF
Dirichlet Smoothing:
TF-IDF Weighting:
Irrelevant part
1
( )log ( | ) ( ) log 1 log( )
( )i k
i iq kc kk
tf wp q d tf w d
p w
����������������������������
( )log 1
( )i k
c k
tf w
p w
Dirichlet Smoothing & TF-IDF
Dirichlet Smoothing:
Look at the tf.idf part( )
( ) log 1( )
i ki k
c k
tf wtf w
p w
( )( ) log 1
( )i k
c kc k
tf wp w
p w
( ) ( )( ) i k c ki k
i
tf w p wp w
d
��������������
Dirichlet Smoothing Hyper-Parameter
Dirichlet Smoothing: Hyper-parameter
When is very small, approach MLE estimator When is very large, approach probability on whole collection
How to set appropriate ?
( ) ( )( ) i k c ki k
i
tf w p wp w
d
��������������
Dirichlet Smoothing Hyper-Parameter
Leave One out Validation:
1 11 1
( ) 1 ( )( | / )
1i c
iii
tf w p wp w d w
d
����������������������������
Leave w1 out1 1( | / )ip w d w��������������
...
wj
w1
( | / )ij jp w d w��������������
...
( ) 1 ( )( | / )
1
i j c jii j j
i
tf w p wp w d w
d
����������������������������
Leave wj out
......
*
1arg max ( , )l C
Dirichlet Smoothing Hyper-Parameter
Leave One out Validation:
11
( ) 1 ( )( , ) log
1
idi j c j
iij
tf w p wl d
d
��������������
����������������������������
Leave all words out one by one for a document
11 1
( ) 1 ( )( , ) log
1
idCi j c j
ii j
tf w p wl C
d
��������������
��������������
Do the procedure for all documents in a collection
Find appropriate
1 1( | / )ip w d w��������������
...
wj
w1
( | / )ij jp w d w��������������
...
Dirichlet Smoothing Hyper-Parameter
What type of document/collection would get large ?
– Most documents use similar vocabulary and wording pattern as
the whole collection
What type of document/collection would get small ?
– Most documents use different vocabulary and wording pattern
than the whole collection
Shrinkage
Maximum Likelihood (MLE) builds model purely on document
data and generates query word Model may not be accurate when document is short (many unseen
words)
Shrinkage estimator builds more reliable model by consulting
more general models (e.g., collection language model)Example: Estimate P(Lung_Cancer|Smoke)
West Lafayette Indiana U.S.
Shrinkage
( )( ) (1 ) ( )i ki k c k
i
tf wp w p w
d ��������������
Jelinek Mercer Smoothing Assume for each word, with probability , it is generated from
document language model (MLE), with probability 1- , it is generated from collection language model (MLE)
Linear interpolation between document language model and collection language model
JM Smoothing:
( ) ( )( ) i k c ki k
i
tf w p wp w
d
��������������
( )( ) (1 ) ( )i ki k c k
i
tf wp w p w
d ��������������
Shrinkage
Relationship between JM Smoothing and Dirichlet Smoothing
JM Smoothing:
1( ) ( )i k c k
i
tf w p wd
��������������
( )1( )
i i k
c ki i
d tf wp w
d d
��������������
����������������������������( )
( )i
i kc k
i i i
d tf wp w
d d d
��������������
������������������������������������������
Model Based Feedback
Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models
Kullback-Leibler (KL) Divergence between two probabilistic distributions
( )( ) log
( )x
p xKL p q p x
q x
����������������������������
It is the distance between two probabilistic distributions
It is always larger than zero
How to prove it ?
Model Based Feedback
Equivalence of retrieval based on query generation likelihood and Kullback-Leibler (KL) Divergence between query and document language models
( , ) ( )
( )( ) log
( )
( ) log ( ) log
i i
w i
iw w
Sim q d KL q d
q wq w
p w
q w p w q w q w
��������������������������������������� ���
Loglikelihood of query generation probability
Document independent constant
Generalize query representation to be a distribution (fractional term weighting)
Estimating language model
id
Language Model for id
Estimate the generation probability of Pr( | )q
id
Retrieval results
q
Calculate KL Divergence
Retrieval results
q
Estimating query language model
Language Model for q
( )iKL q d������������� �
Estimating document language model
id
Language Model for id
Model Based Feedback
Calculate KL Divergence
Retrieval results
q
Estimating query language model
Language Model for q
( )iKL q d������������� �
Estimating document language model
id
Language Model for id
Feedback Documents from initial results
Language Model for qF
New Query Model'
q (1 )q qF
0
No feedback'
q q Full feedback
'q qF
1
Model Based Feedback
*
1
arg max ( , )
arg max log ( ) (1 ) ( )
F
F
Fq
n
F i C iq i
q l X
q w p w
Assume there is a generative model to produce each word within feedback document(s)
For each word in feedback document(s), given
w
w
Feedback Documents
qF(w)
PC(w)1-
Flip a coin
Background model
Topic words
Model Based Feedback: Estimate Fq
For each word, there is a hidden variable telling which language model it comes from
the 0.12to 0.05it 0.04a 0.02…sport 0.0001basketball 0.00005
…
Background Model
pC(w|C)
…sport =? basketball =? game =? player =?
…
Unknownquery topicp(w|F)=?
“Basketball”
1-=0.8
=0.2
FeedbackDocuments
If we know the value of hidden variable of each word ...
MLEEstimator
Model Based Feedback: Estimate Fq
For each word, the hidden variable Z i = {1 (feedback), 0 (background}
Step1: estimate hidden variable based current on model parameter (Expectation)
( )
( )
( 1) ( | 1)( 1| )
( 1) ( | 1) ( 0) ( | 0)
( )
( ) (1 ) ( | )
i i ii i
i i i i i i
tF i
tF i C i
p z p w zp z w
p z p w z p z p w z
q w
q w p w C
E-step
Step2: Update model parameters based on the guess in step1 (Maximization)
the (0.1) basketball (0.7) game (0.6) is (0.2) ….
Model Based Feedback: Estimate Fq
( 1) ( , ) ( 1| )( | )
( , ) ( 1| )t i i iF i F
j j jj
c w F p z wq w
c w F p z w
M-Step
Expectation-Maximization (EM) algorithm0
Fq
Step1: (Expectation)
Step2: (Maximization)
( )
( )
( )( 1| )
( ) (1 ) ( | )
tF i
i i tF i C i
q wp z w
q w p w C
( 1) ( , ) ( 1| )( | )
( , ) ( 1| )t i i iF i F
j j jj
c w F p z wq w
c w F p z w
Step 0: Initialize values of
Model Based Feedback: Estimate Fq
Give =0.5
Properties of parameter If is close to 0, most common words can be generated from collection language model, so more topic words in query
language mode
Model Based Feedback: Estimate Fq
If is close to 1, query language model has to generate most common words, so fewer topic words in query language mode
Introduction to language model
Retrieval Model: Language Model
Maximum Likelihood estimation Maximum a posterior estimation Jelinek Mercer Smoothing
Unigram language model
Document language model estimation
Model-based feedback