*introduction to natural language processing (600.465) statistical machine translation

32
12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic

Upload: roary-mcbride

Post on 31-Dec-2015

41 views

Category:

Documents


0 download

DESCRIPTION

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation. Dr. Jan Hajič cCS Dept., Johns Hopkins Univ. [email protected] www.cs.jhu.edu/~hajic. The Main Idea. Treat translation as a noisy channel problem: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/Jan Hajic 1

*Introduction to Natural Language Processing (600.465)

Statistical Machine Translation

Dr. Jan Hajič

cCS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 2: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 2

The Main Idea

• Treat translation as a noisy channel problem: Input (Source) “Noisy” Output (target)

The channel

E: English words... (adds “noise”) F: Les mots Anglais...

• The Model: P(E|F) = P(F|E) P(E) / P(F)

• Interested in rediscovering E given F:

After the usual simplification (P(F) fixed):

argmaxE P(E|F) = argmaxE P(F|E) P(E) !

Page 3: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 3

The Necessities

• Language Model (LM)P(E)

• Translation Model (TM): Target given source P(F|E)

• Search procedure– Given E, find best F using the LM and TM distributions.

• Usual problem: sparse data– We cannot create a “sentence dictionary” E ↔F

– Typically, we do not see a sentence even twice!

Page 4: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 4

The Language Model

• Any LM will do:– 3-gram LM

– 3-gram class-based LM

– decision tree LM with hierarchical classes

• Does not necessarily operates on word forms:– cf. later the “analysis” and “generation” procedures

– for simplicity, imagine now it does operate on word forms

Page 5: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 5

The Translation Models

• Do not care about correct strings of English words (that’s the task of the LM)

• Therefore, we can make more independence assumptions:– for start, use the “tagging” approach:

• 1 English word (“tag”) ~ 1 French word (“word”)

– not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!)

• use “Alignment”.

Page 6: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 6

The Alignment

0 1 2 3 4 5 6

• e0 And the program has been implemented

• f0 Le programme a été mis en application

0 1 2 3 4 5 6 7• Linear notation:

• f0(1) Le(2) programme(3) a(4) été(5) mis(6) en(6) application(6)

• e0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7)

Page 7: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 7

Alignment Mapping

• In general:– |F| = m, |E| = l (length of sent.):

•lm connections (each French word to any English word),

• 2lm different alignments for any pair (E,F) (any subset)

• In practice:– From English to French

• each English word 1-n connections (n - empirical max.-fertility?)

• each French word exactly 1 connection

– therefore, “only” (l+1)m alignments ( << 2lm ) • aj = i (link from j-th French word goes to i-th English word)

Page 8: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 8

Elements of Translation Model(s)

• Basic distribution:• P(F,A,E) - the joint distribution of the English sentence,

the Alignment, and the French sentence (length m)• Interested also in marginal distributions:

P(F,E) = A P(F,A,E)

P(F|E) = P(F,E) / P(E) = A P(F,A,E) / A,F P(F,A,E) = A P(F,A|E)

• Useful decomposition [one of possible decompositions]:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

Page 9: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 9

Decomposition

• Decomposition formula again:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

m - length of French sentence

aj - the alignment (single connection) going from j-th French w.

fj - the j-th French word from F

a1j-1 - sequence of alignments ai up to the word preceding fj

a1j - sequence of alignments ai up to and including the word fj

f1j-1 - sequence of French words up to the word preceding fj

Page 10: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 10

Decomposition and the Generative Model

• ...and again:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

• Generate:– first, the length of the French given the English words E;

– then, the link from the first position in F (not knowing the actual word yet) now we know the English word

– then, given the link (and thus the English word), generate the French word at the current position

– then, move to the next position in F until m position filled.

Page 11: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 11

Approximations

• Still too many parameters– similar situation as in n-gram model with “unlimited” n– impossible to estimate reliably.

• Use 5 models, from the simplest to the most complex

(i.e. from heavy independence assumptions to light)• Parameter estimation:

Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc.

Page 12: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 12

Model 1

• Approximations:– French length P(m | E) is constant (small )

– Alignment link distribution P(aj|a1j-1,f1

j-1,m,E) depends on English length l only (= 1/(l+1))

– French word distribution depends only on the English and French word connected with link aj.

• Model 1 distribution:

P(F,A|E) = / (l+1)m j=1..m p(fj|eaj)

Page 13: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 13

Models 2-5

• Model 2– adds more detail into P(aj|...): more “vertical” links preferred

• Model 3– adds “fertility” (number of links for a given English word is

explicitly modeled: P(n|ei)

– “distortion” replaces alignment probabilities from Model 2

• Model 4– the notion of “distortion” extended to chunks of words

• Model 5 is Model 4, but not deficient (does not waste probability to non-strings)

Page 14: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 14

The Search Procedure

• “Decoder”:– given “output” (French), discover “input” (English)

• Translation model goes in the opposite direction:p(f|e) = ....

• Naive methods do not work.• Possible solution (roughly):

– generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates!

Page 15: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 15

Analysis - Translation - Generation (ATG)

• Word forms: too sparse• Use four basic analysis, generation steps:

– tagging

– lemmatization

– word-sense disambiguation

– noun-phrase “chunks” (non-compositional translations)

• Translation proper:– use chunks as “words”

Page 16: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 16

Training vs. Test with ATG

• Training:– analyze both languages using all four analysis steps

– train TM(s) on the result (i.e. on chunks, tags, etc.)

– train LM on analyzed source (English)

• Runtime/Test:– analyze given language sentence (French) using identical

tools as in training

– translate using the trained Translation/Language model(s)

– generate source (English), reversing the analysis process

Page 17: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 17

Analysis: Tagging and Morphology

• Replace word forms by morphologically processed text:– lemmas

– tags• original approach: mix them into the text, call them “words”• e.g. She bought two books. she buy VBP two book NNS.

• Tagging: yes– but reversed order:

• tag first, then lemmatize [NB: does not work for inflective languages]

• technically easy

• Hand-written deterministic rules for tag+form lemma

Page 18: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 18

Word Sense Disambiguation, Word Chunking

• Sets of senses for each E, F word:– e.g. book-1, book-2, ..., book-n

– prepositions (de-1, de-2, de-3,...), many others

• Senses derived automatically using the TM– translation probabilities measured on senses: p(de-3|from-5)

• Result:– statistical model for assigning senses monolingually based on

context (also MaxEnt model used here for each word)

• Chunks: group words for non-compositional translation

Page 19: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 19

Generation

• Inverse of analysis• Much simpler:

– Chunks words (lemmas) with senses (trivial) – Words (lemmas) with senses words (lemmas) (trivial)– Words (lemmas) + tags word forms

• Additional step:– Source-language ambiguity:

• electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms.

Page 20: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/Jan Hajic 20

*Introduction to Natural Language Processing (600.465)

Statistical Translation: Alignment and Parameter Estimation

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

[email protected]

www.cs.jhu.edu/~hajic

Page 21: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 21

Alignment

• Available corpus assumed:– parallel text (translation E ↔F)

• No alignment present (day marks only)!• Sentence alignment

– sentence detection

– sentence alignment

• Word alignment– tokenization

– word alignment (with restrictions)

Page 22: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 22

Sentence Boundary Detection• Rules, lists:

– Sentence breaks:• paragraphs (if marked)

• certain characters: ?, !, ; (...almost sure)

• The Problem: period .– could be end of sentence (... left yesterday. He was heading to...)– decimal point: 3.6 (three-point-six)– thousand segment separator: 3.200 (three-thousand-two-hundred)– abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr.– ellipsis: ...– other languages: ordinal number indication (2nd ~ 2.)– initials: A. B. Smith

• Statistical methods: e.g., Maximum Entropy

Page 23: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 23

Sentence Alignment• The Problem: sentences detected only:• E:• F:• Desired output: Segmentation with equal number of segments,

spanning continuously the whole text.• Original sentence boundaries kept:• E:• F:• Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1

• New segments called “sentences” from now on.

Page 24: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 24

Alignment Methods

• Several methods (probabilistic and not prob.)– character-length based

– word-length based

– “cognates” (word identity used)• using an existing dictionary (F: prendre ~ E: make, take)

• using word “distance” (similarity): names, numbers, borrowed words, Latin origin words, ...

• Best performing: – statistical, word- or character- length based (with some

words perhaps)

Page 25: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 25

Length-based Alignment

• First, define the problem probabilistically: argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed)

• Define a “bead”:• E:• F:• Approximate:

P(A,E,F) i=1..nP(Bi),

where Bi is a bead; P(Bi) does not depend on the rest of E,F.

“bead” (2:2 in this case)

Page 26: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 26

The Alignment Task

• Given the model definition,

P(A,E,F) i=1..nP(Bi),

find the partitioning of (E,F) into n beads Bi=1..n, that maximizes P(A,E,F) over training data.

• Define Bi = p:qi, where p:q {0:1,1:0,1:1,1:2,2:1,2:2}

– describes the type of alignment

• Want to use some sort of dynamic programming:• Define Pref(i,j)... probability of the best alignment

from the start of (E,F) data (1,1) up to (i,j)

Page 27: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 27

Recursive Definition

• Initialize: Pref(0,0) = 0.• Pref(i,j) = max (

Pref(i,j-1) P(0:1k), Pref(i-1,j) P(1:0k), Pref(i-1,j-1) P(1:1k),

Pref(i-1,j-2) P(1:2k), Pref(i-2,j-1) P(2:1k), Pref(i-2,j-2) P(2:2k) )

• This is enough for a Viterbi-like search.

• E:• F:

i

j

Pref(i-2,j-2) P(2:2k)Pref(i-2,j-1) P(2:1k)Pref(i-1,j-2) P(1:2k)Pref(i-1,j-1) P(1:1k)Pref(i-1,j)P(1:0k)Pref(i,j-1) P(0:1k)

Page 28: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 28

Probability of a Bead

• Remains to define P(p:qk) (the red part): – k refers to the “next” bead, with segments of p and q sentences,

lengths lk,e and lk,f.

• Use normal distribution for length variation:

• P(p:qk) = P(lk,e,lk,f,,2,p:q) P(lk,e,lk,f,,2)P(p:q)

• lk,e,lk,f,,2 = (lk,f - lk,e)/lk,e2

• Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data.

• Words etc. might be used as better clues in P(p:qak) def.

Page 29: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 29

Saving time

• For long texts (> 104 sentences), even Viterbi (in the version needed) is not effective (o(S2) time)

• Go paragraph by paragraph if they are aligned 1:1• What if not?• Apply the same method first to paragraphs!

– identify paragraphs roughly in both languages

– run the algorithm to get aligned paragraph-like segments

– then, run on sentences within paragraphs.

• Performs well if not many consecutive 1:0 or 0:1 beads.

Page 30: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 30

Word alignment

• Length alone does not help anymore.– mainly because words can be swapped, and mutual translations

have often vastly different length.

• ...but at least, we have “sentences” (sentence-like segments) aligned; that will be exploited heavily.

• Idea:– Assume some (simple) translation model (such as Model 1).– Find its parameters by considering virtually all alignments.– After we have the parameters, find the best alignment given

those parameters.

Page 31: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 31

Word Alignment Algorithm

• Start with sentence-aligned corpus.

• Let (E,F) be a pair of sentences (actually, a bead).

• Initialize p(f|e) randomly (e.g., uniformly), fF, eE.

• Compute expected counts over the corpus:

c(f,e) = (E,F);eE,fF p(f|e)

aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e).

• Reestimate:

p(f|e) = c(f,e) / c(e) [c(e) = f c(f,e)]

• Iterate until change of p(f|e) is small.

Page 32: *Introduction to  Natural Language Processing (600.465) Statistical Machine Translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 32

Best Alignment

• Select, for each (E,F),

A = argmaxA P(A|F,E) = argmaxA P(F,A|E)/P(F) =

argmaxA P(F,A|E) = argmaxA ( / (l+1)m j=1..m p(fj|eaj)) =

argmaxA j=1..mp(fj|eaj) (IBM Model 1)

• Again, use dynamic programming, Viterbi-like algorithm.• Recompute p(f|e) based on the best alignment

• (only if you are inclined to do so; the “original” summed-over-all distribution might perform better).

• Note: we have also got all Model 1 parameters.