statistical translation and web search ranking jianfeng gao natural language processing, msr july...

74
Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Upload: basil-hall

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Statistical Translation and Web Search Ranking

Jianfeng GaoNatural language processing, MSR

July 22, 2011

Page 2: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Who should be here?

• Interested in statistical machine translation and Web search ranking

• Interested in modeling technologies• Look for topics for your master/PhD thesis– A difficult topic: very hard to beat a simple

baseline– An easy topic: others cannot beat it either

Page 3: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

3

Outline

• Probability• Statistical Machine Translation (SMT)• SMT for Web search ranking

Page 4: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Probability (1/2)

• Probability space:

– Cannot say • Joint probability:

– Probability that x and y are both true• Conditional probability:

– Probability that y is true when we already know x is true• Independence:

– x and y are independent

Page 5: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Probability (2/2)

• : assumptions on which the probabilities are based

• Product rule –from the def of conditional probability

• Sum rule – a rewrite of the marginal probability def

• Bayes rule – from the product rule

Page 6: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

An example: Statistical Language Modeling

Page 7: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Statistical Language Modeling (SLM)

• Model form– capture language structure via a probabilistic

model

• Model parameters– estimation of free parameters using training data

Page 8: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

8

Model Form

• How to incorporate language structure into a probabilistic model

• Task: next word prediction– Fill in the blank: “The dog of our neighbor ___”

• Starting point: word n-gram model– Very simple, yet surprisingly effective– Words are generated from left-to-right– Assumes no other structure than words

themselves

Page 9: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

9

Word N-gram Model

• Word based model– Using chain rule on its history (=preceding words)

Page 10: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Word N-gram Model• How do we get probability estimates?

– Get text and count!

• Problem of using the whole history– Rare events: unreliable probability estimates– Assuming a vocabulary of 20,000 words, model # parametersunigram P(w1) 20,000bigram P(w2|w1) 400Mtrigram P(w3|w1w2) 8 x 1012

fourgram P(w4|w1w2w3) 1.6 x 1017

From Manning and Schütze 1999: 194

Page 11: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

11

Word N-gram Model • Markov independence assumption

– A word depends only on N-1 preceding words– N=3 → word trigram model

• Reduce the number of parameters in the model– By forming equivalence classes

• Word trigram model

𝑃 (𝑤𝑖|¿ 𝑠>𝑤1𝑤2…𝑤𝑖− 2𝑤𝑖 −1 )=𝑃 (𝑤𝑖|𝑤𝑖 −2𝑤 𝑖−1 ¿

...

Page 12: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

12

Model Parameters

• Bayesian estimation paradigm• Maximum likelihood estimation (MLE)• Smoothing in N-gram language models

Page 13: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

13

Bayesian Paradigm

– – Posterior probability– – Likelihood– – Prior probability– – Marginal probability

• Likelihood versus probability– for fixed , defines a probability over ; – for fixed , defines the likelihood of .

• Never say “the likelihood of the data”• Always say “the likelihood of the parameters given the

data”

Page 14: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

14

Maximum Likelihood Estimation (MLE)

• : model; : data

– Assume a uniform prior – is independent of , and is dropped

– where is the likelihood of parameter• Key difference between MLE and Bayesian Estimation

– MLE assume that is fixed but unknown, – Bayesian estimation assumes that itself is a random

variable with a prior distribution

Page 15: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

15

MLE for Trigram LM

• It is easy – let us get some real text and start to count

But, why is this the MLE solution?

Page 16: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

16

Derivation of MLE for N-gram

• Homework – an interview question of MSR • Hints– This is a constrained optimization problem– Use log likelihood as objective function– Assume a multinomial distribution of LM– Introduce Lagrange multiplier for the constraints

Page 17: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

17

Sparse Data Problem

• Say our vocabulary size is |V|• There are |V|3 parameters in the trigram LM– |V| = 20,000 20,0003 = 8 1012 parameters

• Most trigrams have a zero count even in a large text corpus

– oops…

Page 18: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

18

Smoothing: Adding One

• Add one smoothing (from Bayesian paradigm)• But works very badly – do not use this

• Add delta smoothing• Still very bad – do not use this

Page 19: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

19

Smoothing: Backoff

• Backoff trigram to bigram, bigram to unigram

D(0,1) is a discount constant – absolute discount α is calculated so probabilities sum to 1 (homework)

• Simple and effective – use this one!

Page 20: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

20

Outline

• Probability• SMT and translation models• SMT for web search ranking

Page 21: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

SMT

and

C: 救援 人员 在 倒塌的 房屋 里 寻找 生还者E: Rescue workers search for survivors in collapsed houses

Page 22: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

𝑃 (𝐸∨𝐶)• Translation process (generative story)– C is broken into translation units– Each unit is translated into English– Glue translated units to form E

• Translation models– Word-based models– Phrase-based models– Syntax-based models

Page 23: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Generative Modeling

Art

Science

Engineering

Story

Math

Code

Page 24: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Generative Modeling for

• Story making– how a target sentence is generated from a source

sentence step by step• Mathematical formulation– modeling each generation steps in the generative

story using a probability distribution• Parameter estimation– implementing an effective way of estimating the

probability distributions from training data

Page 25: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Word-Based Models: IBM Model 1

• We first choose the length for the target sentence , according to the distribution .

• Then, for each position in the target sentence, we choose a position in the source sentence from which to generate the -th target word according to the distribution

• Finally, we generate the target word by translating according to the distribution .

Page 26: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011
Page 27: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Mathematical Formulation

• Assume that the choice of the length is independent of and

• Assume that all positions in the source sentence are equally likely to be chosen

• Assuming that each target word is generated independently from

Page 28: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Parameter Estimation

• Model Form

• MLE on word-aligned training data

• Don’t forget smoothing

Page 29: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Phrase-Based Models

Page 30: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Mathematical Formulation

• Assume a uniform probability over segmentations

• Use the maximum approximation to the sum

• Assume each phrase being translated independently and use distance-based reordering model

Page 31: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Parameter Estimation

MLE: Don’t forget smoothing

Page 32: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Syntax-Based Models

Page 33: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Story

• Parse an input Chinese sentence into a parse tree

• Translate each Chinese constituent into English– VP (PP 寻找 NP, search for NP PP)

• Glue these English constituents into a well-formed English sentence.

Page 34: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Other Two Tasks?

• Mathematical formation– Based on synchronous context free grammar (SCFG)

• Parameter estimation– Learning SCFG from data

• Homework • Let us go thru an example (thanks to Michel

Galley)– Hierarchical phrase model– Linguistically syntax-based models

Page 35: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

rescue

workers

for

survivors

in

collapsed

houses

search

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

倒塌 的 房屋 collapsed houses

Page 36: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

rescue

workers

for

survivors

in

collapsed

houses

search

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

search for survivors in collapsed houses在 倒塌 的 房屋 里 寻找 生还者

Page 37: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

rescue

workers

for

survivors

in

collapsed

houses

search

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

search for survivors in collapsed houses在 倒塌 的 房屋 里 寻找 生还者

Page 38: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

A synchronous rule

• Phrase-based translation unit• Discontinuous translation unit• Control on reordering

里 寻找 在

Page 39: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

A synchronous grammar

里 寻找 在

倒塌 的 房屋

生还者

Context-free derivation:

search for survivors in collapsed houses在 倒塌 的 房屋 里 寻找 生还者

search for in collapsed houses在 倒塌 的 房屋 里 寻找

search for in 在 里 寻找

Page 40: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

A synchronous grammar

里 寻找 在

倒塌 的 房屋

生还者

search for survivors in collapsed housesRecognizes:

search for collapsed houses in survivors

search for survivors collapsed houses in

Page 41: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

NNS

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

NNS

NP

S

NN

Page 42: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

NNS

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

NNS

NP

S

NN

Page 43: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

NNS

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

NNS

NP

S

NN

Page 44: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

VP

VP

NP寻找VBP

PP

PPIN

PP NPsearch for

Page 45: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

VP-234 NP-57寻找 PP-32PP-32 NP-57search for

SCFG rule:

Page 46: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

NNS

PP

VP

VBP

救援 人员 在 倒塌 的 房屋 里 寻找 生还者

Rescue workers search for survivors in collapsed houses.

rescue staff in collapse of house in search survivors

JJ NNS

NPPP

PP

NNS

NP

PP

VP

VBP

IN

NNS

NP

S

NN

Page 47: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

47

Outline

• Probability• SMT and translation models• SMT for web search ranking

Page 48: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Web Documents and Search Queries

• cold home remedy • cold remeedy• flu treatment• how to deal with

stuffy nose?

Page 49: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Map Queries to Documents• Fuzzy keyword matching

– Q: cold home remedy– D: best home remedies for cold and flu

• Spelling correction– Q: cold remeedies– D: best home remedies for cold and flu

• Query alteration– Q: flu treatment– D: best home remedies for cold and flu

• Query/document rewriting– Q: how to deal with stuffy nose– D: best home remedies for cold and flu

• Where are we now?

Page 50: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Research Agenda (Gao et al. 2010, 2011)

• Model documents and queries as different languages (Gao et al., 2010)

• Cast mapping queries to documents as bridging the language gap via translation

• Leverage statistical machine translation (SMT) technologies and infrastructures to improve search relevance

Page 51: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Are Queries and Docs just Different Languages?

• A large scale analysis, extending (Huang et al. 2010)

• Divide web collection into different fields, e.g., queries, anchor text, titles, etc.

• Develop a set of language models, each on one n-gram datasets from a different field

• Measure language difference between different fields (queries/docs) via perplexity

Page 52: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Microsoft Web N-gram Model Collection (cutoff = 0)

• Microsoft web n-gram services. http://research.microsoft.com/web-ngram

Page 53: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Perplexity Results

• Test set– 733,147 queries from the May 2009 query log

• Summary– Query LM is most predictive of test queries– Title is better than Anchor in lower order but is worse in higher

order– Body is in a different league

Page 54: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

SMT for Document Ranking• Given a query (q), doc (d) can be ranked by

how likely it is that q is rewritten from d,

• An example: phrasal statistical translation for Web document ranking

how to deal with stuffy nose?

Page 55: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Phrasal Statistical Translation for Rankingd: “cold home remedies” titleS: [“cold”, “home remedies”] segmentationT: [“stuffy nose”, “deal with”] translationM: (1 2, 2 1) permutationq: “deal with stuffy nose” query

• Uniform probability over S: • Maximum approximation:

• Max probability assignment via dynamic programming: and

• Model training on query-doc pairs

Page 56: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Mine Query-Document Pairs from User Logs

http://www.agelessherbs.com/BestHomeRemediesColdFlu.html

NO CLICK

NO CLICK

how to deal with stuffy nose?

stuffy nose treatment

cold home remedies

Page 57: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Mine Query-Document Pairs from User Logs

how to deal with stuffy nose?

stuffy nose treatment

cold home remedies

Page 58: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Mine Query-Document Pairs from User Logs

how to deal with stuffy nose?

stuffy nose treatment

cold home remedies

QUERY (Q) Title (T)how to deal with stuffy nose best home remedies for cold and flustuffy nose treatment best home remedies for cold and flucold home remedies best home remedies for cold and flu… … … …go israel forums goisrael communityskate at wholesale at pr wholesale skates southeastern skate supplybreastfeeding nursing blister baby clogged milk ducts babycenterthank you teacher song lyrics for teaching educational children s musicimmigration canada lacolle cbsa office detailed information

• 178 million pairs from 0.5 year log

Page 59: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Evaluation Methodology

• Measurement: NDCG, t-test• Test set: – 12,071 English queries sampled from 1-y log– 5-level relevance label for each query-doc pair– On a tail document sets (click field is empty)

• Training data for translation models:– 82,834,648 query-title pairs

Page 60: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Baseline: Word-Based Models (Berger&Lafferty, 99)

• Basic model:• Mixture model:

• Learning translation probabilities from clickthrough data– IBM Model 1 with EM

Page 61: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

ResultsSample IBM-1 word translation probability after EM training on the Query-title pairs

Page 62: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Bilingual Phrases

• Notice that with context information, we have less ambiguous translations

Page 63: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Results

• Ranking results– All features

– Only phrase translation features

Page 64: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Why Do Bi-Phrases Help?

• Length distribution

• Good/bad examples

Page 65: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Generative Topic Models

• Probabilistic latent Semantic Analysis (PLSA)

– d is assigned a single most likely topic vector– q is generated from the topic vectors

• Latent Dirichlet Allocation (LDA) generalizes PLSA– a posterior distribution over topic vectors is used– PLSA = LDA with MAP inference

Q: stuffy nose treatment D: cold home remediesTopic

Q: stuffy nose treatment D: cold home remedies

Page 66: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Bilingual Topic Model

• For each topic z: • For each q-d pair: • Each q is generated by and • Each w is generated by and

Page 67: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Log-likelihood of LDA Given Data

• : distribution of distribution• LDA requires integral over • This is the MAP approximation to LDA

Page 68: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

MAP Estimation via EM• Estimate by maximizing joint log likelihood of

q-d pairs and the parameters• E-Step: compute posterior probabilities– ,

• M-Step: update parameters using the posterior probabilities– ,,

Page 69: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Posterior Regularization (PR)

• q and its clicked d are relevant, thus they– Share same prior distribution over topics (MAP)– Weight each topic similarly (PR)

• Model training via modified EM– E-step: for each q-d pair, project the posterior topic

distributions onto a constrained set, where the expected fraction of each topic is equal in q and d

– M-step: update parameters using the projected posterior probabilities

Page 70: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Topic Models for Doc Ranking

Page 71: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Evaluation Methodology

• Measurement: NDCG, t-test• Test set: – 16,510 English queries sampled from 1-y log– Each query is associated with 15 docs– 5-level relevance label for each query-doc pair

• Training data for translation models:– 82,834,648 query-title pairs

Page 72: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Topic Model Results

Page 73: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Summary

• Probability– Basics– A case study of a probabilistic model: N-gram language model

• Statistical Machine Translation (SMT)– Generative modeling (story math code)– Word/phrase/syntax based models

• SMT for web search ranking– View query and doc as different language– Doc ranking via – Word/phrase/topic based models

• Slides/doc will be available at http://research.microsoft.com/~jfgao/

Page 74: Statistical Translation and Web Search Ranking Jianfeng Gao Natural language processing, MSR July 22, 2011

Main Reference• Berger, A., and Lafferty, J. 1999. Information retrieval as statistical translation.

In SIGIR, pp. 222-229.• Gao, J., He, X., and Nie, J-Y. 2010. Clickthrough-based translation models for

web search: from word models to phrase models. In CIKM, pp. 1139-1148.• Gao, J., Toutanova, K., and Yih, W-T. 2011. Clickthrough-based latent semantic

models for web search. In SIGIR.• Huang, J., Gao, J., Miao, J., Li, X., Wang, K., and Behr, F. 2010. Exploring web

scale language models for search query processing. In Proc. WWW 2010, pp. 451-460.

• MacKay, David J. C. 2003. Information Theory, Inference and Learning Algorithms. Cambridge: Cambridge University Press.

• Manning, C., and H. Chutze. 1999. Foundations of statistical natural language processing. MIT Press. Cambridge.

• Philipp Koehn. Statistical Machine Translation. Cambridge University Press. 2009.