02/19/13english-indian language mt (phase-ii)1 english – indian language machine translation...
TRANSCRIPT
02/19/13 English-Indian Language MT (Phase-II) 1
English – Indian Language Machine Translation
Anuvadaksh Phase – II
- The SMT Team, CDAC Mumbai
02/19/13 English-Indian Language MT (Phase-II) 2
English-Indian Language Machine Translation (MT)
Anuvadaksh (Phase-1 Background)
DIT Funded: Consortium Mode Project (10 instt.) Objective
Deploy Eng-Indian Lang MT system using 4 engines Develop language res. and tools for 2 domains
Phase-I: Achievements Deployed an MT system for 3 pairs (E-Hin,Mar,Ben) Two of the four engines gave comparable translations
CDAC Mumbai: Statistical Machine Translation EngineFirst SMT engine to be developed in India under TDIL purview CDAC Pune (Consortium Leader): Tree Adjoining Grammar Engine
CDAC Mumbai: Language Resources 15000 sentence corpora developed (Eng-Mar) 3000 word synset creation
Test Report evaluated by GIST, CDAC Pune
02/19/13 English-Indian Language MT (Phase-II) 3
English-Indian Language Machine Translation (MT) Anuvadaksh (Phase-2) [Jul 2010 - Jul 2013]
CDACM Objective: Extend the MT system deployed in Phase-I (esp. the SMT engine) Improvise the SMT engine using reordering and factored
models Introduce language knowledge with the help of language
verticals (conceptualising the hybrid approach) Developing language resources in the form of bilingual corpus
for health domain
Team Members Rajnath Patel, Rohit Gupta, Ritesh Shah
02/19/13 English-Indian Language MT (Phase-II) 4
Anuvadaksh-IIFinancial Status
Total Budget Outlay: Rs. 14,99,20,000 CDACM Budget: Rs. 98,79,000 for 3 years (2010-
2013) Total funds released up to 31st Dec 2011
Rs. 31,68,000 Total expenditure upto 31st Dec 2011
Rs. 52,46,767 Deficit incurred
Rs. 20,78,767
(to be adjusted against grant-in-aid, 2012-13)
02/19/13 English-Indian Language MT (Phase-II) 5
Existing web-service mode changed Integration for improved SMT subsystem with the
Anuvadaksh system completed successfully Development of consistent APIs for easy integration
with the EILMT system Reordered models for Marathi added Integration of all three language modules
Anuvadaksh-IITasks completed (1)
02/19/13 English-Indian Language MT (Phase-II) 6
Anuvadaksh-IITasks completed (2)
Multi-word expressions (MWE) annotation task Classification of about 1000 words completed
Wordnet based dictionary extraction completed Report for pipeline like architecture for overall
improvement of the system prepared Consolidation of all the components represented as factors by
various language verticals added Roles and responsibilities for the resp. instt. assigned
TDIL task: Submitted a reference (equivalent to 25 books) to bilingually aligned corpora from Sahitya Akadami website
02/19/13 English-Indian Language MT (Phase-II) 7
bilingual corpus
Corpus resources
Morphological Analysis
Decoder
Core TM estimation
module
WSD processed data
TM probability phrase table
Word Sense Disambiguation
POS Tagger
Name Entity Recognition
Multi Word Expression Extraction
Morph tagged data
POS tagged data
NE tagged data
MWE data
UNL TaggerTAG module UNL tagged data
Clause marker
Syntactic reordering component
TAG annotated data
Clause marked data
Reordered data
SMT engine: Advanced TM module ( Components or Factors could vary across languages)
Anuvadaksh-II : Tasks completed (3)
02/19/13 English-Indian Language MT (Phase-II) 9
Enhancement of SMT engine(C-DAC, Mumbai & IIT-Bombay)
Source Pre-processing and responsible institutes:
Source Pre-processing Institutes responsible
MA JU
POS Stanford POS
NER IIT-B NER
MWE IIT-B, CDAC-M, JU
WSD IIT-B
UNL semantic mapping IIT-B
TAG Parsed output CDAC-P
Clause boundary marking IIIT-H
Anuvadaksh-II : Tasks completed (4)
02/19/13 English-Indian Language MT (Phase-II) 10
Enhancement of SMT engine (Contd…) (C-DAC Mumbai, IIT-Bombay) Target Pre-processing & Language model
Target Pre-processing Language
Model
MA(segmentation & case marker)
POS NER(JU)
MWE(IIT-B)
WSD (IIT-B)
Source re-ordering(CDAC-M)
Transliteration(IIIT-A)
LM Development(CDAC-M)
English
E-Hindi IIIT-H(ILMT)
IIIT-H(ILMT)
IIIT-H IIIT-H IIIT-H IIIT-H IIIT-H IIIT-H
E-Marathi IIT-B(ILMT)
IIT-B(ILMT)
IIT-B IIT-B IIT-B IIT-B IIT-B IIT-B
E-Bangla JU(ILMT)
JU(ILMT)
JU JU JU JU JU JU
E-Tamil AU(ILMT)
AU(ILMT)
AU AU AU AU AU AU
E-Urdu IIIT-A(ILMT)
IIIT-A(ILMT)
IIIT-A IIIT-A IIIT-A IIIT-A IIIT-A IIIT-A
E-Oriya UU, CDAC-P UU(IIIT-BHU, CLIA)
UU, CDAC-P
UU, CDAC-P
UU, CDAC-P
UU, CDAC-P
UU, CDAC-P
UU, CDAC-P
E-Gujrati DDU DDU(DICT, CLIA)
DDU DDU DDU DDU DDU DDU
E-Bodo NEHU, CDAC-P NEHU, (CLIA)
NEHU, CDAC-P
NEHU, CDAC-P
NEHU, CDAC-P
NEHU, CDAC-P
NEHU, CDAC-P
NEHU, CDAC-P
Anuvadaksh-II : Tasks completed (5)
02/19/13 English-Indian Language MT (Phase-II) 11
Anuvadaksh-IITasks completed (6)
LMs created using various smoothing techniques Hindi (15000 sentences + BBC monolingual
corpus) Marathi (13000 sentences) Bengali (14000 sentences) Tamil (14000 sentences) Gujarati (2000 sentences)
02/19/13 English-Indian Language MT (Phase-II) 12
Anuvadaksh-IIAchievements
Reordered + factored (Improvements for Hindi) Source side factor (POS) BLEU (Non-Factored) : 32.45 BLEU (Factored) : 32.93
Reordered Baseline (good for Marathi)
Standardized XML log format update as per the requirements
Anuvadaksh-IIAchievements
Publication:
Learning Improved Models for Urdu, Farsi and Italian using SMT - Rohit Gupta, Raj N. Patel and Ritesh Shah,
Proceedings of the first workshop on Reordering for Statistical Machine Translation, COLING 2012,
Mumbai, India, December 8-15, 2012
Applying statistical MT techniques to learn improved reordering modelsStudy of correlation between reordering and distortion-parameters for English-Urdu pair among others
02/19/13 English-Indian Language MT (Phase-II) 13
02/19/13 English-Indian Language MT (Phase-II) 14
GRADE POINT (0-4)
Version 2.0(Feb 2013)
Version 1.0(July 2012)
4 (>=80%) 39% 14%
3(60%-79%) 26% 18%
2(40%-59%) 25% 26%
1(20%-39%) 10% 37%
0(<20%) 0 5%
SMT Improvements (Hindi) Corpus: EILMT Tourism Corpus (approx 15000
sentences)
Anuvadaksh-IIAchievements
02/19/13 English-Indian Language MT (Phase-II) 15
Anuvadaksh-IIAchievements
SMT Improvements (Marathi) Corpus: EILMT Tourism Corpus (approx 13000
sentences)
GRADE POINT (0-4)
Baseline(Eval 1)
Baseline (Eval 2) Reordered
4 (>=80%) 24% 20% 10%
3(60%-79%) 26% 23% 31%
2(40%-59%) 15% 25% 43%
1(20%-39%) 34% 25% 16%
0(<20%) 1% 7% 0
02/19/13 English-Indian Language MT (Phase-II) 16
Anuvadaksh-IIFuture Plan
Use factored model in the Statistical MT engine to enhance translations for all languages in the tourism domain
For the health domain specifically, obtain translations using existing resources and evaluate basic coverage of grammar for this domain
The entire system with its hybrid approach has to be deployed efficiently and the outputs have to be sent to the testing team at CDAC Pune.
02/19/13 English-Indian Language MT (Phase-II) 17
Thank you