current status of ilmt a perspective of translation from marathi to hindi

23
CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi

Upload: tyrone-townsend

Post on 24-Dec-2015

238 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • CURRENT STATUS OF ILMT A perspective of translation from Marathi to Hindi
  • Slide 2
  • Architecture
  • Slide 3
  • Example Flow
  • Slide 4
  • Modules Extensively Improved Morph Analyzer Lexical Transfer
  • Slide 5
  • Marathi MA changes The morph was modified to resolve the issues found in testing the Morph's output. The resources were updated by adding 16000 new roots to the Lexicon and by creating several new SRRs. This covers all the words in the Marathi wordnet. Revised TAM labels. Developed methods for Handling of Taddhitas (i.e. words derived from nouns, adjectives and adverbs) and compounds, but not integrated into ILMT pipeline. Current accuracy is 95% on ILMT data. The stand-alone morphological analyzer also reports the derivational process.
  • Slide 6
  • Marathi Compounding In linguistics, a Compound Word is a lexeme that consists of more than one stem. They are a kind of MWEs. Easier to predict properties then MWEs. {mamamami} {uncle-aunt (maternal)} (a noun). Mostly Marathi has only 2 stems with rare 3 stem cases. {bhaubahin}{brother-sister} has a Hindi equivalent - {bhai-behen}. Individual components are directly translated. Advantage for close languages like Marathi and Hindi.
  • Slide 7
  • Problem Definition Given a word containing two components (and hence roots) a and b, inflected and appended with suffixes, identify each one and provide linguistic information and category of compound word: Field 1 : Field 2 : ;;. fs af means feature structure in abbreviated form. CGNPTAM means grammatical category, gender, number, person, tense, aspect and modality. Fincat: Grammatical category of the resultant word. If no features then give only the root words with short description.
  • Slide 8
  • Taxonomy of Compounds Compound words Words with both components meaningful No duplication {mamamami} {uncle-aunt (maternal)} Partial Reduplication {garmagaram} {very hot} Words with only one component meaningful First component meaningful {aardha murdha} {halfway} Second component meaningful {sispencil} {penpencil} Negation words {ayogya} {inappropriate} Reduplication words {tukdetukde} {in pieces} Sandhi words {atyurja} {too much energy} Echo Words {rastorasti} {every road}
  • Slide 9
  • Results N o. Type Input word count Split count Analysed count Percentage correctly analysed 1 Both roots distinct and meaningful 23423122596% 2 Partial Reduplication 25 100% 3 Only first root meaningful 3530 85.7% 4 Only second root meaningful 118872.7% 5 Negation words12 100% 6 Reduplication words 59 100% 7 Echo words54 100% 8 Sandhi words3128 90.3%
  • Slide 10
  • Marathi Synset Linkage Total number of synsets for which words were Cross-linked: 18,000 Now reflected in the bilingual dictionary used for lexical transfer Total Marathi Synsets : 26557 Total unique words : 36394 Total linked Synsets : 23967
  • Slide 11
  • Corpus Statistics Tourism size =240,000 words Healthsize=255,000 words Generalsize=30.7 million words (news domain) POS Corpus annotated (tagged and cross-checked) General domain: 2,63,037 words Tourism domain: 1,36,640 words Health domain: 44,202 words(Set 1) Health domain: 21,345 words(Set 2)
  • Slide 12
  • Lexical Transfer Module changes The dictionary currently has 316 Akhyata pairs, 68 Kridanta pairs, and 40 entries for irregular mappings. A number of bugs involving the transfer of the base forms of verbs have been eliminated. Bugs related a sudden crash in the system due to improper coding have been eliminated. Lexical transfer module now selects the first synset in sequence corresponding to the given word. Transfer of ordinals, conjunctions etc. also have been included. The features of the NER module are now being properly utilized for the transliteration of the necessary named entities.
  • Slide 13
  • Current Status Results by CDAC Pune For Health: Comprehensibility/Adequacy : 81% Fluency : 53% For Tourism: Comprehensibility/Adequacy : 78% Fluency : 52%
  • Slide 14
  • Evaluation Method S5: Number of score 5 Sentences, S4: Number of score 4 sentences, S3: Number of score 3 sentences, N: Total Number of sentences Score = Score : 5Correct Translation Score : 4Understandable with minor errors Score : 3Understandable with major errors Score : 2Not Understandable Score : 1Nonsense translation Linguists give a score out of 5 to the sentences without foreknowledge of their meaning. The score tells of the subjective quality of the sentence.
  • Slide 15