hindi to english statistical machine translation naveed khan [under the guidance of] prof. pushpak...
TRANSCRIPT
Hindi To English Statistical Machine Translation
Naveed Khan
[Under the guidance of]
Prof. Pushpak Bhattacharyya
CFILT, IIT-Bombay
Presentation Outline Overview of Statistical Approach
Language Model
Translation Model
Components of SMT
Moses and Giza++
Word based alignments
Parallel alignments
Phrase based SMT
Moses Steps
Evaluation of results
Conclusion and Future Work
Overview of Statistical Approach
“Find the English translation e corresponding to a given Foreign sentence f”
Thus, we seek ebest such that
ebest = argmaxe P(e |f ) = argmaxe [P(e) * P(f |e)]
Language Model – P(e)
Translation Model – P(f |e)
Translations are produced on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Language Model
The goal is to find high fluency English sentence for a given sentence s1s2 …… sn
Pr(s1s2 …… sn) = Pr(s1) * Pr(s2|s1) *. . . * Pr(sn|s1 s2 . . . sn-1)
Here Pr(sn|s1 s2 . . . sn-1) is the probability that word sn
follows word string s1 s2 . . . sn-1
N-gram model probability
Trigram model probability calculation
Translation Model [1/2] It is a generative model, given a Hindi language
sentence it tries to find highly fluent English language sentence
Whenever it faces an English sentence, it reasons backward and tries to identify which Hindi sentence is likely to produce this English sentence
Since sentences are infinite and it is not possible to find pr(f,e) for all pairs of sentences, the concept of allignment is introduced
Pr f ∣e =∑a
Pr f , a∣e
Translation Model [2/2] Allignment is the mapping of individual words in
aligned sentence pairs
A= {a1, a2, a3, a4,....am} is termed as an
allignment, where aj = set of positions in English
sentence to which jth word of foreign language is aligned.
Without loss of generality we can say that),,,|Pr(),,,Pr()|Pr()|,Pr( 1
111
11
emfafemfaemeaf jjj
jm
jj
Choose the length of foreign language
string m given eChoose the alignment
a given e, m
Choose the identity of English word f given e,
m, a
Components Of SMT
Moses and Giza++
GIZA++ is a freely available implementation of the IBM Models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs.
Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices.
These tools can be obtained in their debain form from http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/pool/jaunty/nlp/
Word Based Alignment
For Each word in source language, align words from target language that this word possibly produces
Based on IBM models 1-5
Model 1 is the simplest
As we go from models 1 to 5, models get more complex but more realistic
Parallel Alignments Hindi to English Alignments
# Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36
इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� .
NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 })
English to Hindi Alignments
# Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21
the area has got the digamber jain temple which houses the birds hospital .
NULL ({ 5 }) इस ({ 1 }) क्षे�त्र ({ 2 }) में� ({ }) दिगम्बर ({ 3 4 6 }) जै�न ({ 7 }) में�दिर ({ 8 }) प्रा�प्त ({ }) कर ({ }) लि�या� ({ }) है� ({ }) जै� ({ 9 }) , ({ }) बर्ड्� �स ({ 12 }) अस्पत�� ({ 13 }) क� ({ 11 }) स�चया ({ 10 }) करत� ({ }) है� ({ }) . ({ 14 })
Phrase-Based SMT Consider translation for sentence ”र�में
चम्मेंच स� च�व� खा�त� है�”
र�में चम्मेंच स� च�व� खा�त� है�
Ram eats rice with a spoon
Hindi Phrase English phrase Probability
र�में Ram 0.5
र�में न� Ram 0.5
चम्मेंच स� with a spoon 1.0
च�व� rice 1.0
खा�त& है� eats 0.6
खा�त� है� eats 0.4
Moses Steps [1/4] Training [../train-factored-phrase-model.perl -scripts-root-dir ../scripts -root-dir . -corpus filename.clean
-e en -f hi -lm 0:3:../filename.lm:0]
Preparation of data
Run GIZA++
train.hi1 इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क�
स�चया करत� है� .2 स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0
और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में� क3 फ्री, और न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� .
3 छे�ट� बच्चों= क� में�दिर= में� �� जै�या� जै�त� है� और उनक� परिरचया ब3ध्दि@ एव� ज्ञा�न क, �व& , सरस्वत& क� आग� वर्ण�में��� क� अक्षेर= स� करव�या� जै�त� है� .
train.en1 the area has got the digamber jain temple which houses the birds hospital .2 places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh .3 young children are taken to the temples and are introduced to the letters of the alphabet in front of saraswati , the goddess of wisdom and learning .
hi-en.A3.final# Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36
इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� .
NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 }) # Sentence pair (2) source length 27 target length 33 alignment score : 2.45498e-47
स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0 और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में� क3 फ्री, और
न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� . NULL ({ 11 }) places ({ 1 }) where ({ 2 }) a ({ 3 }) tourist ({ 4 }) can ({ }) whiz ({ }) past ({ }) his ({ 5 }) worries ({ 6 7 }) include ({ }) gulmarg ({ 17 }) in ({ 16 }) jammu ({ 13 }) and ({ 14 }) kashmir ({ 15 }) , ({ 18 }) auli ({ 19 }) in ({ 20 }) grawhal ({ 8 9 10 21 }) , ({ 12 22 }) kufri ({ 26 }) and ({ 27 }) narkanda ({ 28 29 30 31 32 }) in ({ 25 }) himachal ({ 23 }) pradesh ({ 24 }) . ({ 33 })
en-hi.A3.final# Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21the area has got the digamber jain temple which houses the birds hospital . NULL ({ 5 }) इस ({ 1 }) क्षे�त्र ({ 2 }) में� ({ }) दिगम्बर ({ 3 4 6 }) जै�न({ 7 }) में�दिर ({ 8 }) प्रा�प्त ({ }) कर ({ }) लि�या� ({ }) है� ({ }) जै�({ 9 }) , ({ }) बर्ड्� �स ({ 12 }) अस्पत�� ({ 13 }) क� ({ 11 }) स�चया({ 10 }) करत� ({ }) है� ({ }) . ({ 14 }) # Sentence pair (2) source length 33 target length 27 alignment score : 4.73882e-36places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . NULL ({ }) स्था�न ({ 1 }) जैहै�� ({ 2 }) एक ({ 3 }) पया�टक ({ 4 })
उसक, ({ 8 }) लिचन्त�ओं ({ 6 7 9 }) क� ({ }) प&छे� ({ }) छे�र्ड् ({ }) सकत� ({ 5 }) है� ({ }) , ({ }) जैम्में0 ({ 13 }) और ({ 14 }) कश्में&र ({ 15
}) में� ({ 12 }) ग3�मेंग� ({ 10 11 }) , ({ 16 }) गढव�� ({ }) में� ({ }) औ�& ({ 17 18 19 }) , ({ 20 }) हिहैमें�च� ({ 25 }) प्रा�श ({ 26 }) में� ({
24 }) क3 फ्री, ({ 21 }) और ({ 22 }) न�रूण्र्ड्� ({ 23 }) क� ({ }) सम्मिम्मेंलि�त ({ }) करत� ({ }) हुया� ({ }) . ({ 27 })
Moses Steps [2/4] Align words
To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diag-final starts with the intersection of the two alignments and then adds additional alignment points.
Get lexical translation table
aligned.grow-diag-final
1. इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� . the area has got the digamber jain temple which houses the birds hospital . 0-0 1-1 3-2 9-2 3-3 6-3 2-4 3-5 7-5 8-5 16-5 4-6 5-7 10-8 15-9 11-10 14-10 12-11 13-12 18-13
2. स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0 और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में�
क3 फ्री, और न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� . places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . 0-0 1-1 2-2 3-3 9-4 5-5 5-6 4-7 5-8 6-8 16-9 16-10 15-11 12-12 13-13 14-14 17-15 18-16 19-17 7-18 8- 18 20-18 11-19 21-19 25-20 26-21 27-22 28-22 29-22 30-22 31-22 24-23 22-24 23-25 32-26
model/lex.h2e
बEक banking 0.0588235 बEक bank 0.2571429 बEक several 0.0116279 बEक banks 0.1269841 बEक sterling 0.0526316 बEक paperwork 0.2857143
या0हिनयान union 0.1142857 अन्तिन्तमें success 0.0909091 अन्तिन्तमें final 0.1111111 अन्तिन्तमें eighties 0.1428571 अन्तिन्तमें last 0.0933333 अन्तिन्तमें terminus 0.0476190
Moses Steps [3/4]
Extract Phrases
The for each line: Hindi phrase, English phrase and the allignment points. Alignment points are pairs (hindi, english). Also, an inverted alignment file extract.inv is generated.
model/extract.0-0 इस ||| the ||| 0-0 इस क्षे�त्र ||| the area ||| 0-0 1-1 क्षे�त्र ||| area ||| 0-0
में� ||| the ||| 0-0 जै�न ||| jain ||| 0-0 जै�न में�दिर ||| jain temple ||| 0-0 1-1 में�दिर ||| temple ||| 0-0
जै� ||| which ||| 0-0 जै� , बर्ड्� �स अस्पत�� क� स�चया ||| which houses the birds hospital ||| 0-0 5-1 1-2 4-2 2-3 3-4
स�चया ||| houses ||| 0-0, बर्ड्� �स अस्पत�� क� स�चया ||| houses the birds hospital ||| 4-0 0-1 3-1 1-2 2-3, बर्ड्� �स अस्पत�� क� ||| the birds hospital ||| 0-0 3-0 1-1 2-2
बर्ड्� �स ||| birds ||| 0-0 बर्ड्� �स अस्पत�� ||| birds hospital ||| 0-0 1-1
अस्पत�� ||| hospital ||| 0-0. ||| . ||| 0-0 है� . ||| . ||| 1-0
Moses Steps [4/4] Score Phrases
A translation table is created from the stored phrase translation pairs.
जै�न ||| jain ||| (0) ||| (0) ||| 1 0.981818 0.857143 0.915254 2.718
क्षे�त्र ||| area ||| (0) ||| (0) ||| 0.8375 0.671779 0.503759 0.376936 2.718
बर्ड्� �स ||| birds ||| (0) ||| (0) ||| 0.0175439 0.0147059 1 1 2.718
बर्ड्� �स अस्पत�� ||| birds hospital ||| (0) (1) ||| (0) (1) ||| 1 0.0026738 1 0.5 2.718
अस्पत�� ||| hospital ||| (0) ||| (0) ||| 0.4 0.181818 1 0.5 2.718
स�चया ||| houses ||| (0) ||| (0) ||| 0.0327869 0.0134529 1 0.5 2.718
में�दिर ||| temple ||| (0) ||| (0) ||| 0.864903 0.768421 0.763838 0.760417 2.718
Phrase translation probability (f|e)
Lexical Weighting lex(f|e) Phrase translation
probability (e|f)
Lexical Weighting lex(e|f)
Phrase Penalty Always exp(1)=2.718
Decoding
Phrase table entry [ खा�त� है� eats]
Hindi sentence: र�में चम्मेंच स� च�व� खा�त� है�
Probability=p1 p1=p(र�में|Ram)*pLM(Ram|<start>)*d(0)
h= * चम्मेंच स� च�व� खा�त� है� e= Ram
Probability=p1*p2 p2=p( खा�त� है�|eats)*pLM(eats|Ram<start>)*d(2)
Phrase table entry [ र�में Ram]
h= * चम्मेंच स� च�व� * * e= Ram eats
Probability=p1*p2*p3 p3=p(च�व�|rice)*pLM(rice|eats<start>)*d(2)
h= * चम्मेंच स� * * * e= Ram eats rice
Probability=p1*p2*p3*p4 p4=p( चम्मेंच स�|with a spoon)*pLM(with a spoon|rice<start>)*d(2)
h= * * * * * * e= Ram eats rice with a spoon
Phrase table entry [ च�व� rice]
Phrase table entry [ चम्मेंच स� with a spoon]
Some Positive Results H: शब्शहै� धमें�श��� क� मेंत�ब पहिवत्र शरर्णस्था�� हैE .
E: dharamshala literally means ' the holy refuge .
H: फत3हैप3र स&कर& ��� ब�3आ पत्थर में� एक मेंहै�क�व्य है� .
E: fatehpur sikri is an epic in red sandstone .
H: क3 ल्�3 घा�टN भी& व��& ओह्फ ग�ह्र्ड्�स क� न�में स� प्राचलि�त है� .
E: the kullu valley also known as the valley of the gods .
H: वस्त3ओं क, ग3र्णवत्ता� परिरवत�नश&� है� , पर�त3 आपक� अच्छा� अस�& सT� मिमें� सकत� है� .
E: the quality of goods varies , but you may well find a genuine bargain .
H: हिहैमें�च� प्रा�श क, र�जैध�न& लिशमें�� क� पहै�र्ड्& स्ट�शन= क, र�न& कहै� जै�त� हैE .
E: shimla the capital of himachal pradesh , called the queen of hill stations .
H: क्व&न हिवक्ट�रिरया� न� ब्��कफ्रीयास� हिWजै क� श3भी�र�भी नवम्बर 1869 में� हिकया� .
E: queen victoria opened blackfriars bridge in november 1869 .
H: र�जैघा�ट यामें3न� क� हिकन�र� मेंहै�त्में� ग��ध& क� श��त स्में�रक है� .
E: on the banks of yamuna raj ghat is the serene memorial of mahatma gandhi .
Error Analysis H: ब�हिकघामें प���स मेंहै�र�न& तथा� या3वर�जै हिफलि�प्स क� ��न हिनव�स है� .
E: queen and prince philip buckingham palace is the london home of the .
Error: The translated sentence followes a wrong word order.
H: व�ज्ञा�हिनक तर&क� स� एक दिव्य स�र क� लि�ए त�र�मेंण्र्ड्� आए� .E: scientific a celestial trip to आए� planetarium .
Error: Since 'आए�' is not present in the phrase table the word is left unalteredThe selection of the phrases from the phrase table is done in the decoding step [ आए� ; 9-9] is not been executed.
H: ऊं� ट सफ�रिरया�� अपन& उत्पत्तित्ता क� भी�रत एव� च&न क� ब&च व्य�प�र क� समेंया में� लिचध्दिन्हैत करत& हैE जैब ऊं� ट क�रव= मेंस��= , जैर्ड्&ब0दिटया= एव� रत्न= स� �� हुए स्था�हिपत व्य�प�र में�ग\ क� स�था या�त्र� करत� था� .E: camel safaris india and china its origin to the time of trade between mark of when camel caravans spices and herbs , precious stones , from established trade routes laden with travel and
Error: The translations of each and every word in the sentence is properly done but the proper word order does not exist.
Evaluation Criteria Automatic Evaluation BLEU: measures n-gram precision of a translation
with respect to given reference translations
Higher score indicates better translation
Subjective Evaluation Translations are judged by human evaluators on
fluency and adequacy on the scale of 1 to 5
Subjective Evaluation
Level Interpretation
5 Flawless English, with no grammatical errors whatsoever4 Good English, with a few minor errors in morphology3 Non-native English, possibly a few minor grammatical errors2 Disfluent English, with most phrases correct, but ungrammatical
overall1 Incomprehensible
Fluency
Adequacy
Level Interpretation
5 All meaning is conveyed
4 Most of the meaning is conveyed
3 Much of the meaning is conveyed
2 Little meaning is conveyed
1 None of the meaning is conveyed
BLEU Score Evaluation
Input Type BLEU
Baseline(wx-input) 26.06
Short Sentences(wx-input) 28.73
Baseline(Unicode-input) 26.12
Short Sentences(Unicode-input) 26.59
Results
The Hindi to English translated test (400) sentences were manually sorted into four categoriesExcellant Good Mediocre Bad
17 73 171 139
The sentences are compleately
fluent and adequate
Majority of the sentences would make complete
sence if the word order is corrected
Word order problem with a
some words not being translated
Word order issue, words not being translated and
skipping of some words
Conclusion and Future Work
Shorter sentences when translated give out better BLEU score
The pos-tagging, morphological analysis and chunking process of the Hindi sentences and the application of reordering rules is an experiment that is still in progress
Significant improvement in the word order is expected
References Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and
Pushpak Bhattacharyya. Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT,ACL-IJCNLP2009,Singapore,August, 2009
Ananthakrishnan Ramanathan, Pushpak Bhattacharyya, Jayprasad Hegde, Ritesh M.Shah and M. Sasikuma . Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation, Proceedings of IJCNLP, 2008
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263-311. (1993).
Daniel Jurafsky & James H. Martin. An introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall Publication. (2006)
Philipp Koehn, Franz Josef Och and Daniel Marcu . Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL). (2003).
Thank You
Extra-Slides
Chunk-Level Reordering
Reordering Rules Tokenizing the input sentence POS tagging done to the sentence Morphological analysis performed on the sentence Chunking is done to the input sentence that is tokenized+ POS-tagged+ Morph analysed Determining the subject, object and verb chunks SOV to SVO Reordering Reordering the prepositions Modifier Reordering