*introduction to natural language processing (600.465) statistical machine translation

Post on 31-Dec-2015

41 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation. Dr. Jan Hajič cCS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic. The Main Idea. Treat translation as a noisy channel problem: - PowerPoint PPT Presentation

TRANSCRIPT

12/07/1999 JHU CS 600.465/Jan Hajic 1

*Introduction to Natural Language Processing (600.465)

Statistical Machine Translation

Dr. Jan Hajič

cCS Dept., Johns Hopkins Univ.

hajic@cs.jhu.edu

www.cs.jhu.edu/~hajic

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 2

The Main Idea

• Treat translation as a noisy channel problem: Input (Source) “Noisy” Output (target)

The channel

E: English words... (adds “noise”) F: Les mots Anglais...

• The Model: P(E|F) = P(F|E) P(E) / P(F)

• Interested in rediscovering E given F:

After the usual simplification (P(F) fixed):

argmaxE P(E|F) = argmaxE P(F|E) P(E) !

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 3

The Necessities

• Language Model (LM)P(E)

• Translation Model (TM): Target given source P(F|E)

• Search procedure– Given E, find best F using the LM and TM distributions.

• Usual problem: sparse data– We cannot create a “sentence dictionary” E ↔F

– Typically, we do not see a sentence even twice!

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 4

The Language Model

• Any LM will do:– 3-gram LM

– 3-gram class-based LM

– decision tree LM with hierarchical classes

• Does not necessarily operates on word forms:– cf. later the “analysis” and “generation” procedures

– for simplicity, imagine now it does operate on word forms

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 5

The Translation Models

• Do not care about correct strings of English words (that’s the task of the LM)

• Therefore, we can make more independence assumptions:– for start, use the “tagging” approach:

• 1 English word (“tag”) ~ 1 French word (“word”)

– not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!)

• use “Alignment”.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 6

The Alignment

0 1 2 3 4 5 6

• e0 And the program has been implemented

• f0 Le programme a été mis en application

0 1 2 3 4 5 6 7• Linear notation:

• f0(1) Le(2) programme(3) a(4) été(5) mis(6) en(6) application(6)

• e0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 7

Alignment Mapping

• In general:– |F| = m, |E| = l (length of sent.):

•lm connections (each French word to any English word),

• 2lm different alignments for any pair (E,F) (any subset)

• In practice:– From English to French

• each English word 1-n connections (n - empirical max.-fertility?)

• each French word exactly 1 connection

– therefore, “only” (l+1)m alignments ( << 2lm ) • aj = i (link from j-th French word goes to i-th English word)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 8

Elements of Translation Model(s)

• Basic distribution:• P(F,A,E) - the joint distribution of the English sentence,

the Alignment, and the French sentence (length m)• Interested also in marginal distributions:

P(F,E) = A P(F,A,E)

P(F|E) = P(F,E) / P(E) = A P(F,A,E) / A,F P(F,A,E) = A P(F,A|E)

• Useful decomposition [one of possible decompositions]:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 9

Decomposition

• Decomposition formula again:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

m - length of French sentence

aj - the alignment (single connection) going from j-th French w.

fj - the j-th French word from F

a1j-1 - sequence of alignments ai up to the word preceding fj

a1j - sequence of alignments ai up to and including the word fj

f1j-1 - sequence of French words up to the word preceding fj

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 10

Decomposition and the Generative Model

• ...and again:

P(F,A|E) = P(m | E) j=1..m P(aj|a1j-1,f1

j-1,m,E) P(fj|a1j,f1

j-1,m,E)

• Generate:– first, the length of the French given the English words E;

– then, the link from the first position in F (not knowing the actual word yet) now we know the English word

– then, given the link (and thus the English word), generate the French word at the current position

– then, move to the next position in F until m position filled.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 11

Approximations

• Still too many parameters– similar situation as in n-gram model with “unlimited” n– impossible to estimate reliably.

• Use 5 models, from the simplest to the most complex

(i.e. from heavy independence assumptions to light)• Parameter estimation:

Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 12

Model 1

• Approximations:– French length P(m | E) is constant (small )

– Alignment link distribution P(aj|a1j-1,f1

j-1,m,E) depends on English length l only (= 1/(l+1))

– French word distribution depends only on the English and French word connected with link aj.

• Model 1 distribution:

P(F,A|E) = / (l+1)m j=1..m p(fj|eaj)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 13

Models 2-5

• Model 2– adds more detail into P(aj|...): more “vertical” links preferred

• Model 3– adds “fertility” (number of links for a given English word is

explicitly modeled: P(n|ei)

– “distortion” replaces alignment probabilities from Model 2

• Model 4– the notion of “distortion” extended to chunks of words

• Model 5 is Model 4, but not deficient (does not waste probability to non-strings)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 14

The Search Procedure

• “Decoder”:– given “output” (French), discover “input” (English)

• Translation model goes in the opposite direction:p(f|e) = ....

• Naive methods do not work.• Possible solution (roughly):

– generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates!

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 15

Analysis - Translation - Generation (ATG)

• Word forms: too sparse• Use four basic analysis, generation steps:

– tagging

– lemmatization

– word-sense disambiguation

– noun-phrase “chunks” (non-compositional translations)

• Translation proper:– use chunks as “words”

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 16

Training vs. Test with ATG

• Training:– analyze both languages using all four analysis steps

– train TM(s) on the result (i.e. on chunks, tags, etc.)

– train LM on analyzed source (English)

• Runtime/Test:– analyze given language sentence (French) using identical

tools as in training

– translate using the trained Translation/Language model(s)

– generate source (English), reversing the analysis process

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 17

Analysis: Tagging and Morphology

• Replace word forms by morphologically processed text:– lemmas

– tags• original approach: mix them into the text, call them “words”• e.g. She bought two books. she buy VBP two book NNS.

• Tagging: yes– but reversed order:

• tag first, then lemmatize [NB: does not work for inflective languages]

• technically easy

• Hand-written deterministic rules for tag+form lemma

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 18

Word Sense Disambiguation, Word Chunking

• Sets of senses for each E, F word:– e.g. book-1, book-2, ..., book-n

– prepositions (de-1, de-2, de-3,...), many others

• Senses derived automatically using the TM– translation probabilities measured on senses: p(de-3|from-5)

• Result:– statistical model for assigning senses monolingually based on

context (also MaxEnt model used here for each word)

• Chunks: group words for non-compositional translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 19

Generation

• Inverse of analysis• Much simpler:

– Chunks words (lemmas) with senses (trivial) – Words (lemmas) with senses words (lemmas) (trivial)– Words (lemmas) + tags word forms

• Additional step:– Source-language ambiguity:

• electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms.

12/07/1999 JHU CS 600.465/Jan Hajic 20

*Introduction to Natural Language Processing (600.465)

Statistical Translation: Alignment and Parameter Estimation

Dr. Jan Hajič

CS Dept., Johns Hopkins Univ.

hajic@cs.jhu.edu

www.cs.jhu.edu/~hajic

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 21

Alignment

• Available corpus assumed:– parallel text (translation E ↔F)

• No alignment present (day marks only)!• Sentence alignment

– sentence detection

– sentence alignment

• Word alignment– tokenization

– word alignment (with restrictions)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 22

Sentence Boundary Detection• Rules, lists:

– Sentence breaks:• paragraphs (if marked)

• certain characters: ?, !, ; (...almost sure)

• The Problem: period .– could be end of sentence (... left yesterday. He was heading to...)– decimal point: 3.6 (three-point-six)– thousand segment separator: 3.200 (three-thousand-two-hundred)– abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr.– ellipsis: ...– other languages: ordinal number indication (2nd ~ 2.)– initials: A. B. Smith

• Statistical methods: e.g., Maximum Entropy

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 23

Sentence Alignment• The Problem: sentences detected only:• E:• F:• Desired output: Segmentation with equal number of segments,

spanning continuously the whole text.• Original sentence boundaries kept:• E:• F:• Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1

• New segments called “sentences” from now on.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 24

Alignment Methods

• Several methods (probabilistic and not prob.)– character-length based

– word-length based

– “cognates” (word identity used)• using an existing dictionary (F: prendre ~ E: make, take)

• using word “distance” (similarity): names, numbers, borrowed words, Latin origin words, ...

• Best performing: – statistical, word- or character- length based (with some

words perhaps)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 25

Length-based Alignment

• First, define the problem probabilistically: argmaxA P(A|E,F) = argmaxA P(A,E,F) (E,F fixed)

• Define a “bead”:• E:• F:• Approximate:

P(A,E,F) i=1..nP(Bi),

where Bi is a bead; P(Bi) does not depend on the rest of E,F.

“bead” (2:2 in this case)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 26

The Alignment Task

• Given the model definition,

P(A,E,F) i=1..nP(Bi),

find the partitioning of (E,F) into n beads Bi=1..n, that maximizes P(A,E,F) over training data.

• Define Bi = p:qi, where p:q {0:1,1:0,1:1,1:2,2:1,2:2}

– describes the type of alignment

• Want to use some sort of dynamic programming:• Define Pref(i,j)... probability of the best alignment

from the start of (E,F) data (1,1) up to (i,j)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 27

Recursive Definition

• Initialize: Pref(0,0) = 0.• Pref(i,j) = max (

Pref(i,j-1) P(0:1k), Pref(i-1,j) P(1:0k), Pref(i-1,j-1) P(1:1k),

Pref(i-1,j-2) P(1:2k), Pref(i-2,j-1) P(2:1k), Pref(i-2,j-2) P(2:2k) )

• This is enough for a Viterbi-like search.

• E:• F:

i

j

Pref(i-2,j-2) P(2:2k)Pref(i-2,j-1) P(2:1k)Pref(i-1,j-2) P(1:2k)Pref(i-1,j-1) P(1:1k)Pref(i-1,j)P(1:0k)Pref(i,j-1) P(0:1k)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 28

Probability of a Bead

• Remains to define P(p:qk) (the red part): – k refers to the “next” bead, with segments of p and q sentences,

lengths lk,e and lk,f.

• Use normal distribution for length variation:

• P(p:qk) = P(lk,e,lk,f,,2,p:q) P(lk,e,lk,f,,2)P(p:q)

• lk,e,lk,f,,2 = (lk,f - lk,e)/lk,e2

• Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data.

• Words etc. might be used as better clues in P(p:qak) def.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 29

Saving time

• For long texts (> 104 sentences), even Viterbi (in the version needed) is not effective (o(S2) time)

• Go paragraph by paragraph if they are aligned 1:1• What if not?• Apply the same method first to paragraphs!

– identify paragraphs roughly in both languages

– run the algorithm to get aligned paragraph-like segments

– then, run on sentences within paragraphs.

• Performs well if not many consecutive 1:0 or 0:1 beads.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 30

Word alignment

• Length alone does not help anymore.– mainly because words can be swapped, and mutual translations

have often vastly different length.

• ...but at least, we have “sentences” (sentence-like segments) aligned; that will be exploited heavily.

• Idea:– Assume some (simple) translation model (such as Model 1).– Find its parameters by considering virtually all alignments.– After we have the parameters, find the best alignment given

those parameters.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 31

Word Alignment Algorithm

• Start with sentence-aligned corpus.

• Let (E,F) be a pair of sentences (actually, a bead).

• Initialize p(f|e) randomly (e.g., uniformly), fF, eE.

• Compute expected counts over the corpus:

c(f,e) = (E,F);eE,fF p(f|e)

aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e).

• Reestimate:

p(f|e) = c(f,e) / c(e) [c(e) = f c(f,e)]

• Iterate until change of p(f|e) is small.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 32

Best Alignment

• Select, for each (E,F),

A = argmaxA P(A|F,E) = argmaxA P(F,A|E)/P(F) =

argmaxA P(F,A|E) = argmaxA ( / (l+1)m j=1..m p(fj|eaj)) =

argmaxA j=1..mp(fj|eaj) (IBM Model 1)

• Again, use dynamic programming, Viterbi-like algorithm.• Recompute p(f|e) based on the best alignment

• (only if you are inclined to do so; the “original” summed-over-all distribution might perform better).

• Note: we have also got all Model 1 parameters.

top related