statistical machine translation raghav bashyal. statistical machine translation uses pre-translated...

13
Statistical Machine Translation Raghav Bashyal

Upload: oswin-marshall

Post on 17-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Statistical Machine Translation

Raghav Bashyal

Page 2: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Statistical Machine Translation

Uses pre-translated text (copora) Compare translated text to original Notice patterns, associate words

Page 3: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

SMT Process

• Knight – A Statistical Translation Workbook

• Basic probabilities

– P(word)

• Conditional probabilities

– P(word | word)

• …

• Pick the most probable translation

Page 4: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

SMT process

http://isoft.postech.ac.kr/research/SMT/images/math.jpg

Page 5: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Project

Translate basic text from Spanish to English Test effectiveness

with/without hard-coded components (syntax) Specific procedures/algorithms that add speed

Page 6: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Literature

Guides on Statistical Machine Translation Most research project follow the same

procedure as outlined by Knight

• “state of the art” implementation

– Google

Page 7: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Literature

• NLTK

– Christina Wallin

• UC Berkeley

– Modifications

– Larger corpora more useful

• Syntax based

– hard-code

– Higher translation quality when used with SMT

Page 8: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Procedure

NLTK – Natural Language ToolKit Python Made from Natural Language processing projects

Current procedure – read the SMT worksheet Code along with worksheet

Page 9: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Development

• Create corpora

• Tokenization

– Clean string

• Probability

– P(word) in corpora

Page 10: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Smoothing

• Coefficients used to modify probability

– Large coefficients for trigrams

– Small for bigrams and single words

• Normalizes the weight of all the words/phrases

– Trigrams are more valuable

Page 11: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Algorithm

For translation, IMB Model 3 is used:1. For each English word ei indexed by i = 1, 2, ..., 1, choose fertility phi-i with probability

n(phi-i | ei)

2. Choose the number phi-0 of "spurious" French words to be generated from e0 = NULL, using

probability p1 and the sum of fertilities from step 1

3. Let m be the sum of fertilities for all words, including NULL

4. For each i = 0, 1, 2, ...., 1, and each k = 1, 2, ..., phi-i, choose a French word tau-ik

with probability t(tau-ik | ei)

5. For each i = 1, 2, ..., 1, and each k = 1, 2, ..., phi-i, choose target French position

pi-ik with probability d(pi-ik | i, l, m)

6. For each k = 1, 2, ..., phi-0, choose a position pi-0k from the phi-0 - k + 1 remaining

vacant positions in 1, 2, ...m, for a total probability of 1/phi-0!

7. Output the French sentence with words tau-ik in positions pi-ik (0<=i<=1, 1<=k<phi-i)

Page 12: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

Expected Results

Probably will be very basic translation Usually perform better with “sample” text than

“real” text Highlighted errors

Program should use reference data to find some errors

Error frequency plots for certain words Test the effectiveness of adjustments

Hard coding, other algorithms

Page 13: Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original

GUI