hindi to english statistical machine translation naveed khan [under the guidance of] prof. pushpak...

Hindi To English Statistical Machine Translation

Naveed Khan

[Under the guidance of]

Prof. Pushpak Bhattacharyya

CFILT, IIT-Bombay

Presentation Outline Overview of Statistical Approach

Language Model

Translation Model

Components of SMT

Moses and Giza++

Word based alignments

Parallel alignments

Phrase based SMT

Moses Steps

Evaluation of results

Conclusion and Future Work

Overview of Statistical Approach

“Find the English translation e corresponding to a given Foreign sentence f”

Thus, we seek ebest such that

ebest = argmaxe P(e |f ) = argmaxe [P(e) * P(f |e)]

Language Model – P(e)

Translation Model – P(f |e)

Translations are produced on the basis of statistical model

Parameters are estimated using bilingual parallel corpora

Language Model

The goal is to find high fluency English sentence for a given sentence s1s2 …… sn

Pr(s1s2 …… sn) = Pr(s1) * Pr(s2|s1) *. . . * Pr(sn|s1 s2 . . . sn-1)

Here Pr(sn|s1 s2 . . . sn-1) is the probability that word sn

follows word string s1 s2 . . . sn-1

N-gram model probability

Trigram model probability calculation

Translation Model [1/2] It is a generative model, given a Hindi language

sentence it tries to find highly fluent English language sentence

Whenever it faces an English sentence, it reasons backward and tries to identify which Hindi sentence is likely to produce this English sentence

Since sentences are infinite and it is not possible to find pr(f,e) for all pairs of sentences, the concept of allignment is introduced

Pr f ∣e =∑a

Pr f , a∣e

Translation Model [2/2] Allignment is the mapping of individual words in

aligned sentence pairs

A= {a1, a2, a3, a4,....am} is termed as an

allignment, where aj = set of positions in English

sentence to which jth word of foreign language is aligned.

Without loss of generality we can say that),,,|Pr(),,,Pr()|Pr()|,Pr( 1

111

11

emfafemfaemeaf jjj

jm

jj

Choose the length of foreign language

string m given eChoose the alignment

a given e, m

Choose the identity of English word f given e,

m, a

Components Of SMT

Moses and Giza++

GIZA++ is a freely available implementation of the IBM Models. We need it as a initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs.

Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices.

These tools can be obtained in their debain form from http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/pool/jaunty/nlp/

http://cl.aist-nara.ac.jp/~eric-n/ubuntu-nlp/pool/jaunty/nlp/

Word Based Alignment

For Each word in source language, align words from target language that this word possibly produces

Based on IBM models 1-5

Model 1 is the simplest

As we go from models 1 to 5, models get more complex but more realistic

Parallel Alignments Hindi to English Alignments

# Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36

इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� .

NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 })

English to Hindi Alignments

# Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21

the area has got the digamber jain temple which houses the birds hospital .

NULL ({ 5 }) इस ({ 1 }) क्षे�त्र ({ 2 }) में� ({ }) दिगम्बर ({ 3 4 6 }) जै�न ({ 7 }) में�दिर ({ 8 }) प्रा�प्त ({ }) कर ({ }) लि�या� ({ }) है� ({ }) जै� ({ 9 }) , ({ }) बर्ड्� �स ({ 12 }) अस्पत�� ({ 13 }) क� ({ 11 }) स�चया ({ 10 }) करत� ({ }) है� ({ }) . ({ 14 })

Phrase-Based SMT Consider translation for sentence ”र�में

चम्मेंच स� च�व� खा�त� है�”

र�में चम्मेंच स� च�व� खा�त� है�

Ram eats rice with a spoon

Hindi Phrase English phrase Probability

र�में Ram 0.5

र�में न� Ram 0.5

चम्मेंच स� with a spoon 1.0

च�व� rice 1.0

खा�त& है� eats 0.6

खा�त� है� eats 0.4

Moses Steps [1/4] Training [../train-factored-phrase-model.perl -scripts-root-dir ../scripts -root-dir . -corpus filename.clean

-e en -f hi -lm 0:3:../filename.lm:0]

Preparation of data

Run GIZA++

train.hi1 इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क�

स�चया करत� है� .2 स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0

और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में� क3 फ्री, और न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� .

3 छे�ट� बच्चों= क� में�दिर= में� �� जै�या� जै�त� है� और उनक� परिरचया ब3ध्दि@ एव� ज्ञा�न क, �व& , सरस्वत& क� आग� वर्ण�में�� क� अक्षेर= स� करव�या� जै�त� है� .

train.en1 the area has got the digamber jain temple which houses the birds hospital .2 places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh .3 young children are taken to the temples and are introduced to the letters of the alphabet in front of saraswati , the goddess of wisdom and learning .

hi-en.A3.final# Sentence pair (1) source length 14 target length 19 alignment score : 8.99895e-36

इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� .

NULL ({ 18 }) the ({ 1 }) area ({ 2 }) has ({ 10 }) got ({ 7 }) the ({ 3 }) digamber ({ 4 8 9 13 14 16 17 }) jain ({ 5 }) temple ({ 6 }) which ({ 11 }) houses ({ }) the ({ 12 }) birds ({ 15 }) hospital ({ }) . ({ 19 }) # Sentence pair (2) source length 27 target length 33 alignment score : 2.45498e-47

स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0 और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में� क3 फ्री, और

न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� . NULL ({ 11 }) places ({ 1 }) where ({ 2 }) a ({ 3 }) tourist ({ 4 }) can ({ }) whiz ({ }) past ({ }) his ({ 5 }) worries ({ 6 7 }) include ({ }) gulmarg ({ 17 }) in ({ 16 }) jammu ({ 13 }) and ({ 14 }) kashmir ({ 15 }) , ({ 18 }) auli ({ 19 }) in ({ 20 }) grawhal ({ 8 9 10 21 }) , ({ 12 22 }) kufri ({ 26 }) and ({ 27 }) narkanda ({ 28 29 30 31 32 }) in ({ 25 }) himachal ({ 23 }) pradesh ({ 24 }) . ({ 33 })

en-hi.A3.final# Sentence pair (1) source length 19 target length 14 alignment score : 3.37018e-21the area has got the digamber jain temple which houses the birds hospital . NULL ({ 5 }) इस ({ 1 }) क्षे�त्र ({ 2 }) में� ({ }) दिगम्बर ({ 3 4 6 }) जै�न({ 7 }) में�दिर ({ 8 }) प्रा�प्त ({ }) कर ({ }) लि�या� ({ }) है� ({ }) जै�({ 9 }) , ({ }) बर्ड्� �स ({ 12 }) अस्पत�� ({ 13 }) क� ({ 11 }) स�चया({ 10 }) करत� ({ }) है� ({ }) . ({ 14 }) # Sentence pair (2) source length 33 target length 27 alignment score : 4.73882e-36places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . NULL ({ }) स्था�न ({ 1 }) जैहै�� ({ 2 }) एक ({ 3 }) पया�टक ({ 4 })

उसक, ({ 8 }) लिचन्त�ओं ({ 6 7 9 }) क� ({ }) प&छे� ({ }) छे�र्ड् ({ }) सकत� ({ 5 }) है� ({ }) , ({ }) जैम्में0 ({ 13 }) और ({ 14 }) कश्में&र ({ 15

}) में� ({ 12 }) ग3�मेंग� ({ 10 11 }) , ({ 16 }) गढव�� ({ }) में� ({ }) औ�& ({ 17 18 19 }) , ({ 20 }) हिहैमें�च� ({ 25 }) प्रा�श ({ 26 }) में� ({

24 }) क3 फ्री, ({ 21 }) और ({ 22 }) न�रूण्र्ड्� ({ 23 }) क� ({ }) सम्मिम्मेंलि�त ({ }) करत� ({ }) हुया� ({ }) . ({ 27 })

Moses Steps [2/4] Align words

To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diag-final starts with the intersection of the two alignments and then adds additional alignment points.

Get lexical translation table

aligned.grow-diag-final

1. इस क्षे�त्र में� दिगम्बर जै�न में�दिर प्रा�प्त कर लि�या� है� जै� , बर्ड्� �स अस्पत�� क� स�चया करत� है� . the area has got the digamber jain temple which houses the birds hospital . 0-0 1-1 3-2 9-2 3-3 6-3 2-4 3-5 7-5 8-5 16-5 4-6 5-7 10-8 15-9 11-10 14-10 12-11 13-12 18-13

2. स्था�न जैहै�� एक पया�टक उसक, लिचन्त�ओं क� प&छे� छे�र्ड् सकत� है� , जैम्में0 और कश्में&र में� ग3�मेंग� , गढव�� में� औ�& , हिहैमें�च� प्रा�श में�

क3 फ्री, और न�रूण्र्ड्� क� सम्मिम्मेंलि�त करत� हुया� . places where a tourist can whiz past his worries include gulmarg in jammu and kashmir , auli in grawhal , kufri and narkanda in himachal pradesh . 0-0 1-1 2-2 3-3 9-4 5-5 5-6 4-7 5-8 6-8 16-9 16-10 15-11 12-12 13-13 14-14 17-15 18-16 19-17 7-18 8- 18 20-18 11-19 21-19 25-20 26-21 27-22 28-22 29-22 30-22 31-22 24-23 22-24 23-25 32-26

model/lex.h2e

बEक banking 0.0588235 बEक bank 0.2571429 बEक several 0.0116279 बEक banks 0.1269841 बEक sterling 0.0526316 बEक paperwork 0.2857143

या0हिनयान union 0.1142857 अन्तिन्तमें success 0.0909091 अन्तिन्तमें final 0.1111111 अन्तिन्तमें eighties 0.1428571 अन्तिन्तमें last 0.0933333 अन्तिन्तमें terminus 0.0476190

Moses Steps [3/4]

Extract Phrases

The for each line: Hindi phrase, English phrase and the allignment points. Alignment points are pairs (hindi, english). Also, an inverted alignment file extract.inv is generated.

model/extract.0-0 इस ||| the ||| 0-0 इस क्षे�त्र ||| the area ||| 0-0 1-1 क्षे�त्र ||| area ||| 0-0

में� ||| the ||| 0-0 जै�न ||| jain ||| 0-0 जै�न में�दिर ||| jain temple ||| 0-0 1-1 में�दिर ||| temple ||| 0-0

जै� ||| which ||| 0-0 जै� , बर्ड्� �स अस्पत�� क� स�चया ||| which houses the birds hospital ||| 0-0 5-1 1-2 4-2 2-3 3-4

स�चया ||| houses ||| 0-0, बर्ड्� �स अस्पत�� क� स�चया ||| houses the birds hospital ||| 4-0 0-1 3-1 1-2 2-3, बर्ड्� �स अस्पत�� क� ||| the birds hospital ||| 0-0 3-0 1-1 2-2

बर्ड्� �स ||| birds ||| 0-0 बर्ड्� �स अस्पत�� ||| birds hospital ||| 0-0 1-1

अस्पत�� ||| hospital ||| 0-0. ||| . ||| 0-0 है� . ||| . ||| 1-0

Moses Steps [4/4] Score Phrases

A translation table is created from the stored phrase translation pairs.

जै�न ||| jain ||| (0) ||| (0) ||| 1 0.981818 0.857143 0.915254 2.718

क्षे�त्र ||| area ||| (0) ||| (0) ||| 0.8375 0.671779 0.503759 0.376936 2.718

बर्ड्� �स ||| birds ||| (0) ||| (0) ||| 0.0175439 0.0147059 1 1 2.718

बर्ड्� �स अस्पत�� ||| birds hospital ||| (0) (1) ||| (0) (1) ||| 1 0.0026738 1 0.5 2.718

अस्पत�� ||| hospital ||| (0) ||| (0) ||| 0.4 0.181818 1 0.5 2.718

स�चया ||| houses ||| (0) ||| (0) ||| 0.0327869 0.0134529 1 0.5 2.718

में�दिर ||| temple ||| (0) ||| (0) ||| 0.864903 0.768421 0.763838 0.760417 2.718

Phrase translation probability (f|e)

Lexical Weighting lex(f|e) Phrase translation

probability (e|f)

Lexical Weighting lex(e|f)

Phrase Penalty Always exp(1)=2.718

Decoding

Phrase table entry [ खा�त� है� eats]

Hindi sentence: र�में चम्मेंच स� च�व� खा�त� है�

Probability=p1 p1=p(र�में|Ram)*pLM(Ram|<start>)*d(0)

h= * चम्मेंच स� च�व� खा�त� है� e= Ram

Probability=p1*p2 p2=p( खा�त� है�|eats)*pLM(eats|Ram<start>)*d(2)

Phrase table entry [ र�में Ram]

h= * चम्मेंच स� च�व� * * e= Ram eats

Probability=p1*p2*p3 p3=p(च�व�|rice)*pLM(rice|eats<start>)*d(2)

h= * चम्मेंच स� * * * e= Ram eats rice

Probability=p1*p2*p3*p4 p4=p( चम्मेंच स�|with a spoon)*pLM(with a spoon|rice<start>)*d(2)

h= * * * * * * e= Ram eats rice with a spoon

Phrase table entry [ च�व� rice]

Phrase table entry [ चम्मेंच स� with a spoon]

Some Positive Results H: शब्शहै� धमें�श�� क� मेंत�ब पहिवत्र शरर्णस्था�� हैE .

E: dharamshala literally means ' the holy refuge .

H: फत3हैप3र स&कर& �� ब�3आ पत्थर में� एक मेंहै�क�व्य है� .

E: fatehpur sikri is an epic in red sandstone .

H: क3 ल्�3 घा�टN भी& व��& ओह्फ ग�ह्र्ड्�स क� न�में स� प्राचलि�त है� .

E: the kullu valley also known as the valley of the gods .

H: वस्त3ओं क, ग3र्णवत्ता� परिरवत�नश&� है� , पर�त3 आपक� अच्छा� अस�& सT� मिमें� सकत� है� .

E: the quality of goods varies , but you may well find a genuine bargain .

H: हिहैमें�च� प्रा�श क, र�जैध�न& लिशमें�� क� पहै�र्ड्& स्ट�शन= क, र�न& कहै� जै�त� हैE .

E: shimla the capital of himachal pradesh , called the queen of hill stations .

H: क्व&न हिवक्ट�रिरया� न� ब्��कफ्रीयास� हिWजै क� श3भी�र�भी नवम्बर 1869 में� हिकया� .

E: queen victoria opened blackfriars bridge in november 1869 .

H: र�जैघा�ट यामें3न� क� हिकन�र� मेंहै�त्में� ग��ध& क� श��त स्में�रक है� .

E: on the banks of yamuna raj ghat is the serene memorial of mahatma gandhi .

Error Analysis H: ब�हिकघामें प��स मेंहै�र�न& तथा� या3वर�जै हिफलि�प्स क� ��न हिनव�स है� .

E: queen and prince philip buckingham palace is the london home of the .

Error: The translated sentence followes a wrong word order.

H: व�ज्ञा�हिनक तर&क� स� एक दिव्य स�र क� लि�ए त�र�मेंण्र्ड्� आए� .E: scientific a celestial trip to आए� planetarium .

Error: Since 'आए�' is not present in the phrase table the word is left unalteredThe selection of the phrases from the phrase table is done in the decoding step [ आए� ; 9-9] is not been executed.

H: ऊं� ट सफ�रिरया�� अपन& उत्पत्तित्ता क� भी�रत एव� च&न क� ब&च व्य�प�र क� समेंया में� लिचध्दिन्हैत करत& हैE जैब ऊं� ट क�रव= मेंस��= , जैर्ड्&ब0दिटया= एव� रत्न= स� �� हुए स्था�हिपत व्य�प�र में�ग\ क� स�था या�त्र� करत� था� .E: camel safaris india and china its origin to the time of trade between mark of when camel caravans spices and herbs , precious stones , from established trade routes laden with travel and

Error: The translations of each and every word in the sentence is properly done but the proper word order does not exist.

Evaluation Criteria Automatic Evaluation BLEU: measures n-gram precision of a translation

with respect to given reference translations

Higher score indicates better translation

Subjective Evaluation Translations are judged by human evaluators on

fluency and adequacy on the scale of 1 to 5

Subjective Evaluation

Level Interpretation

5 Flawless English, with no grammatical errors whatsoever4 Good English, with a few minor errors in morphology3 Non-native English, possibly a few minor grammatical errors2 Disfluent English, with most phrases correct, but ungrammatical

overall1 Incomprehensible

Fluency

Adequacy

Level Interpretation

5 All meaning is conveyed

4 Most of the meaning is conveyed

3 Much of the meaning is conveyed

2 Little meaning is conveyed

1 None of the meaning is conveyed

BLEU Score Evaluation

Input Type BLEU

Baseline(wx-input) 26.06

Short Sentences(wx-input) 28.73

Baseline(Unicode-input) 26.12

Short Sentences(Unicode-input) 26.59

Results

The Hindi to English translated test (400) sentences were manually sorted into four categoriesExcellant Good Mediocre Bad

17 73 171 139

The sentences are compleately

fluent and adequate

Majority of the sentences would make complete

sence if the word order is corrected

Word order problem with a

some words not being translated

Word order issue, words not being translated and

skipping of some words

Conclusion and Future Work

Shorter sentences when translated give out better BLEU score

The pos-tagging, morphological analysis and chunking process of the Hindi sentences and the application of reordering rules is an experiment that is still in progress

Significant improvement in the word order is expected

References Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and

Pushpak Bhattacharyya. Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT,ACL-IJCNLP2009,Singapore,August, 2009

Ananthakrishnan Ramanathan, Pushpak Bhattacharyya, Jayprasad Hegde, Ritesh M.Shah and M. Sasikuma . Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation, Proceedings of IJCNLP, 2008

P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263-311. (1993).

Daniel Jurafsky & James H. Martin. An introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall Publication. (2006)

Philipp Koehn, Franz Josef Och and Daniel Marcu . Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL). (2003).

Thank You

Extra-Slides

Chunk-Level Reordering

Reordering Rules Tokenizing the input sentence POS tagging done to the sentence Morphological analysis performed on the sentence Chunking is done to the input sentence that is tokenized+ POS-tagged+ Morph analysed Determining the subject, object and verb chunks SOV to SVO Reordering Reordering the prepositions Modifier Reordering

hindi to english statistical machine translation naveed khan [under the guidance of] prof. pushpak...

Documents