02/19/13english-indian language mt (phase-ii)1 english – indian language machine translation...

02/19/13 English-Indian Language MT (Phase-II) 1

English – Indian Language Machine Translation

Anuvadaksh Phase – II

- The SMT Team, CDAC Mumbai


English-Indian Language Machine Translation (MT)

Anuvadaksh (Phase-1 Background)

DIT Funded: Consortium Mode Project (10 instt.) Objective

Deploy Eng-Indian Lang MT system using 4 engines Develop language res. and tools for 2 domains

Phase-I: Achievements Deployed an MT system for 3 pairs (E-Hin,Mar,Ben) Two of the four engines gave comparable translations

CDAC Mumbai: Statistical Machine Translation EngineFirst SMT engine to be developed in India under TDIL purview CDAC Pune (Consortium Leader): Tree Adjoining Grammar Engine

CDAC Mumbai: Language Resources 15000 sentence corpora developed (Eng-Mar) 3000 word synset creation

Test Report evaluated by GIST, CDAC Pune


English-Indian Language Machine Translation (MT) Anuvadaksh (Phase-2) [Jul 2010 - Jul 2013]

CDACM Objective: Extend the MT system deployed in Phase-I (esp. the SMT engine) Improvise the SMT engine using reordering and factored

models Introduce language knowledge with the help of language

verticals (conceptualising the hybrid approach) Developing language resources in the form of bilingual corpus

for health domain

Team Members Rajnath Patel, Rohit Gupta, Ritesh Shah


Anuvadaksh-IIFinancial Status

Total Budget Outlay: Rs. 14,99,20,000 CDACM Budget: Rs. 98,79,000 for 3 years (2010-

2013) Total funds released up to 31st Dec 2011

Rs. 31,68,000 Total expenditure upto 31st Dec 2011

Rs. 52,46,767 Deficit incurred

Rs. 20,78,767

(to be adjusted against grant-in-aid, 2012-13)


Existing web-service mode changed Integration for improved SMT subsystem with the

Anuvadaksh system completed successfully Development of consistent APIs for easy integration

with the EILMT system Reordered models for Marathi added Integration of all three language modules

Anuvadaksh-IITasks completed (1)



Multi-word expressions (MWE) annotation task Classification of about 1000 words completed

Wordnet based dictionary extraction completed Report for pipeline like architecture for overall

improvement of the system prepared Consolidation of all the components represented as factors by

various language verticals added Roles and responsibilities for the resp. instt. assigned

TDIL task: Submitted a reference (equivalent to 25 books) to bilingually aligned corpora from Sahitya Akadami website


bilingual corpus

Corpus resources

Morphological Analysis

Decoder

Core TM estimation

module

WSD processed data

TM probability phrase table

Word Sense Disambiguation

POS Tagger

Name Entity Recognition

Multi Word Expression Extraction

Morph tagged data

POS tagged data

NE tagged data

MWE data

UNL TaggerTAG module UNL tagged data

Clause marker

Syntactic reordering component

TAG annotated data

Clause marked data

Reordered data

SMT engine: Advanced TM module ( Components or Factors could vary across languages)

Anuvadaksh-II : Tasks completed (3)


Enhancement of SMT engine(C-DAC, Mumbai & IIT-Bombay)

Source Pre-processing and responsible institutes:

Source Pre-processing Institutes responsible

MA JU

POS Stanford POS

NER IIT-B NER

MWE IIT-B, CDAC-M, JU

WSD IIT-B

UNL semantic mapping IIT-B

TAG Parsed output CDAC-P

Clause boundary marking IIIT-H



Enhancement of SMT engine (Contd…) (C-DAC Mumbai, IIT-Bombay) Target Pre-processing & Language model

Target Pre-processing Language

Model

MA(segmentation & case marker)

POS NER(JU)

MWE(IIT-B)

WSD (IIT-B)

Source re-ordering(CDAC-M)

Transliteration(IIIT-A)

LM Development(CDAC-M)

English

E-Hindi IIIT-H(ILMT)

IIIT-H(ILMT)

IIIT-H IIIT-H IIIT-H IIIT-H IIIT-H IIIT-H

E-Marathi IIT-B(ILMT)

IIT-B(ILMT)

IIT-B IIT-B IIT-B IIT-B IIT-B IIT-B

E-Bangla JU(ILMT)

JU(ILMT)

JU JU JU JU JU JU

E-Tamil AU(ILMT)

AU(ILMT)

AU AU AU AU AU AU

E-Urdu IIIT-A(ILMT)

IIIT-A(ILMT)

IIIT-A IIIT-A IIIT-A IIIT-A IIIT-A IIIT-A

E-Oriya UU, CDAC-P UU(IIIT-BHU, CLIA)

UU, CDAC-P

UU, CDAC-P

UU, CDAC-P

UU, CDAC-P

UU, CDAC-P

UU, CDAC-P

E-Gujrati DDU DDU(DICT, CLIA)

DDU DDU DDU DDU DDU DDU

E-Bodo NEHU, CDAC-P NEHU, (CLIA)

NEHU, CDAC-P

NEHU, CDAC-P

NEHU, CDAC-P

NEHU, CDAC-P

NEHU, CDAC-P

NEHU, CDAC-P




LMs created using various smoothing techniques Hindi (15000 sentences + BBC monolingual

corpus) Marathi (13000 sentences) Bengali (14000 sentences) Tamil (14000 sentences) Gujarati (2000 sentences)


Anuvadaksh-IIAchievements

Reordered + factored (Improvements for Hindi) Source side factor (POS) BLEU (Non-Factored) : 32.45 BLEU (Factored) : 32.93

Reordered Baseline (good for Marathi)

Standardized XML log format update as per the requirements


Publication:

Learning Improved Models for Urdu, Farsi and Italian using SMT - Rohit Gupta, Raj N. Patel and Ritesh Shah,

Proceedings of the first workshop on Reordering for Statistical Machine Translation, COLING 2012,

Mumbai, India, December 8-15, 2012

Applying statistical MT techniques to learn improved reordering modelsStudy of correlation between reordering and distortion-parameters for English-Urdu pair among others



GRADE POINT (0-4)

Version 2.0(Feb 2013)

Version 1.0(July 2012)

4 (>=80%) 39% 14%

3(60%-79%) 26% 18%

2(40%-59%) 25% 26%

1(20%-39%) 10% 37%

0(<20%) 0 5%

SMT Improvements (Hindi) Corpus: EILMT Tourism Corpus (approx 15000

sentences)




SMT Improvements (Marathi) Corpus: EILMT Tourism Corpus (approx 13000

sentences)

GRADE POINT (0-4)

Baseline(Eval 1)

Baseline (Eval 2) Reordered

4 (>=80%) 24% 20% 10%

3(60%-79%) 26% 23% 31%

2(40%-59%) 15% 25% 43%

1(20%-39%) 34% 25% 16%

0(<20%) 1% 7% 0


Anuvadaksh-IIFuture Plan

Use factored model in the Statistical MT engine to enhance translations for all languages in the tourism domain

For the health domain specifically, obtain translations using existing resources and evaluate basic coverage of grammar for this domain

The entire system with its hybrid approach has to be deployed efficiently and the outputs have to be sent to the testing team at CDAC Pune.


Thank you

02/19/13english-indian language mt (phase-ii)1 english – indian language machine translation...

Documents

enginesdevelop language

help of language verticals

engindian lang mt system

anuvadaksh system

smt team

english hindimarathi

english banglaju

smt engineimprovise