translationusing moses1

30
A SEMINAR REPORT ON English-Assamese Statistical Machine Translation using Moses by KALYANEE KANCHAN BARUAAH

Upload: kalyanee-baruah

Post on 10-May-2015

92 views

Category:

Software


3 download

DESCRIPTION

Natural Language Processing

TRANSCRIPT

  • 1.A SEMINAR REPORT ON English-Assamese Statistical Machine Translation using Moses by KALYANEE KANCHAN BARUAAH

2. Contents Introduction Literature Review Implementation Problems and proposed solutions Results and Evaluation Conclusion and future work References 3. Introduction Natural Language Processing Machine Translation Need for Machine Translation Problems in MT Approaches to machine translation Direct-based MT Rule-based MT Corpus-based MT Knowledge-based MT SOME EXISTING MT SYSTEMS Statistical Machine Translation 4. Introduction NLP (Natural Language Processing) deals with understanding and developing computational theories of Human Language. Such theories allows us to understand the structure of the language and build computer software that can process language. Plays a major role in men-machine communication as well as men-men communication. Machine Translation (MT) is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another. 5. Machine Translation Machine translation the application of computer and language sciences to the development of systems answering practical needs. Need for Machine Translation needed to translate literary works which from any language into native languages. Most of the information available is in English which is understood by only 3% of the population . making available work rich sources of literature available to people across the world. Problems in Machine Translation Translation is not straightforward Automation of translation not easy Idioms Ambiguity 6. Machine Translation MT Approaches 7. MT Approaches Direct MT The most basic form of MT. It translates the individual words in a sentence from one language to another using a two-way dictionary. It makes use of very simple grammar rules Little analysis of source language No parsing Reliance on large two-way dictionary Rule-Based Machine Translation (RBMT; also known as Knowledge-Based Machine Translation; Classical Approach of MT) is a general term that denotes machine translation systems based on linguistic information about source and target languages basically retrieved from (bilingual) dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively 8. MT Approaches Interlingua based Machine Translation Fig 2.3: Multilingual MT system with Interlingua approach The Interlingua Machine Translation converts words into a universal language that is created for the MT simply to translate it to more than one language. can be used in applications like information retrieval. more practical when several languages are to be interpreted since it only needs to translate it from the source language 9. MT Approaches Transfer based MT Transfer based translation have the same idea as that of interlingua i.e. to make a translation it is necessary to have an intermediate representation that captures the "meaning" of the original sentence in order to generate the correct translation. In interlingua-based MT this intermediate representation must be independent of the languages in question, whereas in transfer-based MT, it has some dependence on the language pair involved[9]. Knowledge-based MT Semantic-based approaches to language analysis have been introduced by AI researchers. The approached require a large knowledge-base that includes both ontological and lexical knowledge [6]. 10. MT Approaches Corpus Based Machine Translation classified into statistical and example-based Machine Translation. Example based MT Example based systems use previous translation examples to generate translations for an input provided. When an input sentence is presented to the system, it retrieves a similar source sentence from the example-base and its translation. Statistical Machine Translation SMT requires less human effort to undertake translation. SMT is a machine translation paradigm where translations are generated on the basis of statistical models Example-based MT Statistical-based MT Example-based MT systems use variety of linguistic resources such as dictionaries and thesauri, etc., to translate text. Statistical-based MT uses purely statistical based methods in aligning the words and generation of texts. 11. SOME EXISTING MT SYSTEMS Google Translate Systran Bing Translator Bable Fish 12. Statistical Machine Translation Statistical Machine Translation consists of Language Model (LM), Translation Model (TM) and the decoder. The purpose of the language model is to encourage fluent output and the purpose of the translation model is to encourage similarity between input and output, the decoder maximizes the probability of translated text of target language. SMT is based on ideas used in Information Theory and in particular Shannons noisy-channel model. The purpose of this model is to identify a message which is transmitted through a communication channel and is hence prone to errors due to the channels quality. 13. Statistical Machine Translation Parallel corpus C= a collection of text chunks and their translations.(byproduct of human translations.) Given a source sentence f, select target sentence e. {p(e|f)}= {p(e)*p(f|e)} E(f)=set of hypothesized translation of f. P(f/e)=diverges due to Word order Morphology Syntactic relation Idiomatic ways of expression Sparse datasets(popularized primarily with sparse datasets) 14. Statistical Machine Translation 15. SMT-Language Model A language model gives the probability of a sentence. The probability is computed using n-gram model. Language Model can be considered as computation of the probability of single word given all of the words that precede it in a sentence . A sentence is decomposed into the product of conditional probability. By using chain rule the probability of sentence P (S), is broken down as the probability of individual words P(w). An n-gram model simplifies the task by approximating the probability of a word given all the previous words. 16. SMT-Language Model 17. SMT-Translation Model Udaipur is a famous city The Translation Model helps to compute the conditional probability P(T|S). 18. Implementation.. Install all packages in Moses Install Giza++ Install IRSTLM Training Tuning Generate output (decoding) 19. TRAINING THE MOSES DECODER Prepare data Run GIZA++ Align words Get lexical translation table Extract phrases Score phrases Build lexicalized reordering model Build generation models. Create configuration file 20. PREPARING DATA Tokenising - inserting spaces between words and punctuation. Truecaseing - setting the case of the first word in each sentence. Cleaning - removing empty lines, redundant spaces, and lines that are too short or too long. 21. Sample of Parallel Corpus eng-ass1.en eng-ass1.as Shopping in Udaipur is always a delightful experience and it displays excellent handicrafts and works developed by local traders. September to March is the best season to visit Udaipur. The Shilpagram is designed on the concept of a village with little emphasis on the modern concept. A part of the City Palace is now converted into a museum that displays some of the best forms of art and culture. 22. Sample output English sentences as input Corresponding output in Assamese Kanak Vrindavan is a popular picnic spot in Jaipur City Palace is a synthesis of Mughal and Rajasthani architecture Jama Masjid is the largest mosque in India A part of the City Palace is now converted into a museum 23. Block diagram showing SMT using moses 24. Results and evaluation The output of the experiment was evaluated using BLEU(Bilingual Evaluation Understudy) If we do Assamese-English translation using same parallel corpus, BLEU score of 5.72 is obtained. This is very small and may be because we have used a very small data set. BLEU scores are not commensurate even between different corpora in the same translation direction. Bleu is really only comparable for different systems or system variants on the exact same data. Source/Target BLEU Score English-Assamese 4.71 25. Problems and proposed solution Problems As we have used limited amount of English-Assamese parallel corpus. The efficiency of the translation model is less as efficiency increase when we are with more amount of data (parallel corpus) for training. As it is not convenient here to get a better result of translation for the OOV (Out of vocabulary Words) here as moses tool either ignore OOV words or drop down. We are trying to implement transliteration for those OOV. OOV words can be those words which are not present in the corpus, some proper nouns etc. 26. Problems and proposed solution Solution.. Transliteration -- -> (ku-ma-r) ->(Raj-ku-ma-r) Example: English-Assamese Transliteration For example, when we translate the sentence panaji is a city" ,We have used the following command for incorporating transliteration into translation echo 'panaji is a city'| ~/mymoses/bin/moses -f ~/work/mert-work/moses.ini | ./space.pl | ~/mymoses/bin/moses -f ~/work1/train/model/moses.ini | ./join.pl This gives us the output : 27. Conclusion and Future Work The work can be extended in following directions. The system can also be put in the web-based portal to translate content of one web page in English to Assamese. We will try to increase the corpus for better training for better efficiency. We will try to develop the translation system by our own instead of using Moses MT system. Since all Indian languages follow SOV order, and are relatively rich in terms of morphology, the methodology presented should be applicable to English to Indian language SMT in general. Since morphological and parsing tools are not much widely available for Indian languages, an approach like this which minimizes the use of such tools for the target language would be quite handy. 28. Conclusion and Future Work we try to get more corpora from different domains in such a way that it will cover all the wordings. Since BLUE is not so good for rough translation we need some other evaluation techniques also. We should try the incorporation of shallow syntactic information (POS tags) in our discriminative model to boost the performance of translation. 29. References Machine Translation Approaches and Survey for Indian Languages Antony P. J. G. Singh and G. Singh Lehal, A Punjabi to Hindi Machine Translation System, Coling 2008: Companion volume- Posters and demonstrations, Manchester, August 2008. F.J.Och., GIZA++: Training of statistical translation models, [Online]. Available at: http://fjoch.com/GIZA++.html. Moses Manual Natural%20language%20processing%20- %20Wikipedia,%20the%20free%20encyclopedia.html D. D. Rao, Machine Translation A Gentle Introduction, RESONANCE, July 1998. Statistical machine translation, [Online]. Available,http://en.wikipedia.org/wiki/Statistical_machine_translation S.K. Dwivedi and P. P. Sukadeve, Machine Translation System Indian Perspectives, Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010. Machine Translation , [Online],Available : http://www.ida.liu.se/~729G11/HYPERLINK "http://www.ida.liu.se/~729G11/projekt/studentpapper-10/maria-"projekt/studentpapper- 10/maria- hedblom.pdf Machine Translation, [Online]. Available, http://faculty.ksu.edu.sa/homiedan/Publications/ Machine%20Translation.pdf D. D. Rao, Machine Translation A Gentle Introduction, RESONANCE, July 1998. 30. TTANT T