1 language model csc4170 web intelligence and social computing tutorial 8 tutor: tom chao zhou...

17
1 Language Model CSC4170 Web Intelligence and Social Co mputing Tutorial 8 Tutor: Tom Chao Zhou Email: [email protected]

Upload: agnes-harrell

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

1

Language Model

CSC4170 Web Intelligence and Social ComputingTutorial 8

Tutor: Tom Chao ZhouEmail: [email protected]

Page 2: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

2

Outline

Language models Finite automata and language models Types of language models Multinomial distributions over words

Query likelihood model Application Q&A Reference

Page 3: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

3

Language Models (LMs)

How can we come up with good queries? Think of words that would likely appear in a relevant document.

Idea of LM: A document is a good match to a query if the document model is

likely to generate the query.

Page 4: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

4

Language Models (LMs)

Generative Model: Recognize or generate strings.

The full set of strings that can be generated is called the language of the automaton.

Language Model: A function that puts a probability measure over strings drawn from some

vocabulary.

Page 5: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

5

Language Models (LMs)

Example 1: Calculate the probability of a word sequence.

Multiply the probabilities that the model gives to each word in the sequence, together with the probability of continuing or stopping after producing each word.

P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01) *(0.8*0.8*0.8*0.8*0.8*0.8*0.2) =0.000000000001573

Most of the time, we will omit to include STOP and (1-STOP) probabilities.

Page 6: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

6

Language Models (LMs)

Example 2:

P(s|M1)>P(s|M2)

Page 7: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

7

Language Models (LMs)

Basic LM using chain rule:

Unigram language model: Throws away all conditioning context. Most used in Information Retrieval.

Bigram language model: Condition on the previous term.

Page 8: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

8

Language Models (LMs)

Unigram LM: Bag-of-words model. Multinomial distributions over

words.

Mi dtd itfL

1 ,The length of document d. M is the size of the vocabulary.

multinomial coefficient, can leave out in practical calculations.

Page 9: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

9

Query Likelihood Model

Query likelihood model: Rank document by P(d|q) Likelihood that document d is r

elevant to the query. Using Bayes rule:

P(q) is the same for all documents.

P(d) is treated as uniform across all d. )|()|( dqPqdP

Page 10: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

10

Query Likelihood Model

Multinomial + Unigram:

Retrieve based on a language model: Infer a LM for each document. Estimate P(q|Mdi).

Rank the documents according to these probabilities.

Multinomial coefficient for the query q. Can be ignored.

Page 11: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

11

Query Likelihood Model

Estimating the query generation probability: Maximum Likelihood Estimation (MLE) + unigram LM

Limitations: If we estimate P(t|Md)=0, documents will only give a query nonzero

probability if all of the query terms appear in the document. Occurring words are poorly estimated, the probability of words occ

urring once in the document is overestimated, because their one occurrence was partly by chance.

Page 12: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

12

Query Likelihood Model

Estimating the query generation probability: Maximum Likelihood Estimation (MLE) + unigram LM

Smoothing: Use the whole collection to smooth.

Linear Interpolation (Jelinek-Mercer Smoothing)

Bayesian Smoothing

Page 13: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

13

Query Likelihood Model

Query likelihood model with linear interpolation:

Query likelihood model with Bayesian smoothing:

qt d

Cd

L

MtPMtPdPqdP )

)|()|(()()|(

Page 14: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

14

Query Likelihood Model

Example using unigram + MLE + linear interpolation: d1: Xyzzy reports a profit but revenue is down d2: Quorus narrows quarter loss but revenue decreases further λ=1/2 query: revenue down

ranking: d1>d2

Page 15: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

15

Application

Community-based Question Answering (CQA) System: Question Search.

Given a queried question, find a semantically equivalent question for the queried question.

General Search Engine Given a query, rank documents.

Page 16: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

16

Questions?

Page 17: 1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

17

Reference

Multinomial distribution: http://en.wikipedia.org/wiki/Multinomial_distribution

Likelihood function: http://en.wikipedia.org/wiki/Likelihood

Maximum likelihood: http://en.wikipedia.org/wiki/Maximum_likelihood