cs712: knowledge based mt pushpak bhattacharyya cse department iit bombay 4/3/13 1

131
CS712: Knowledge Based MT Pushpak Bhattacharyya CSE Department IIT Bombay 4/3/13 1

Upload: cody-blake

Post on 22-Dec-2015

227 views

Category:

Documents


2 download

TRANSCRIPT

CS712 Knowledge Based MT

Pushpak BhattacharyyaCSE Department

IIT Bombay4313

Empiricism vs Rationalism

bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do

with itndash 1950s Empiricism (Shannon Skinner Firth

Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp

T)ndash 2010s Return of Rationalism

Resource generation will play a vital role in this revival of rationalism

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

4

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

5

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Empiricism vs Rationalism

bull Ken Church ldquoA Pendulum Swung too Farrdquo LILT 2011ndash Availability of huge amount of data what to do

with itndash 1950s Empiricism (Shannon Skinner Firth

Harris)ndash 1970s Rationalism (Chomsky Minsky)ndash 1990s Empiricism (IBM Speech Group AT amp

T)ndash 2010s Return of Rationalism

Resource generation will play a vital role in this revival of rationalism

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

4

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

5

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Introductionbull Machine Translation (MT) is a technique to translate

texts from one natural language to another natural language using a machine

bull Translated text should have two desired propertiesndash Adequacy Meaning should be conveyed correctlyndash Fluency Text should be fluent in the target

languagebull Translation between distant languages is a difficult

taskndash Handling Language Divergence is a major

challenge

3

4

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

5

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

4

Kinds of MT Systems(point of entry from source to the target text)

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

5

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

5

Why is MT difficult Language Divergence

bull One of the main complexities of MT Language Divergence

bull Languages have different ways of expressing meaningndash Lexico-Semantic Divergencendash Structural Divergence

English-IL Language Divergence with illustrations from Hindi(Dave Parikh Bhattacharyya Journal of MT 2002)

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

6

Language Divergence Theory Lexico-Semantic Divergences

bull Conflational divergencendash F vomir E to be sickndash E stab H churaa se maaranaa (knife-with hit)ndash S Utrymningsplan E escape plan

bull Structural divergencendash E SVO H SOV

bull Categorial divergencendash Change is in POS category (many examples

discussed)

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

7

Language Divergence Theory Lexico-Semantic Divergences (cntd)

bull Head swapping divergencendash E Prime Minister of India H bhaarat ke pradhaan

mantrii (India-of Prime Minister)bull Lexical divergence

ndash E advise H paraamarsh denaa (advice give) Noun Incorporation- very common Indian Language Phenomenon

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

8

Language Divergence Theory Syntactic Divergences

bull Constituent Order divergencendash E Singh the PM of India will address the nation

today H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)

bull Adjunction Divergencendash E She will visit here in the summer H vah yahaa

garmii meM aayegii (she here summer-in will come)bull Preposition-Stranding divergence

ndash E Who do you want to go with H kisake saath aap jaanaa chaahate ho (who withhellip)

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

9

Language Divergence Theory Syntactic Divergences

bull Null Subject Divergence ndash E I will go H jaauMgaa (subject dropped)

bull Pleonastic Divergencendash E It is raining H baarish ho rahii haai (rain

happening is no translation of it)

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

10

Language Divergence exists even for close cousins

(Marathi-Hindi-English case marking and postpositions do not transfer easily)

bull सनिनित भत (immediate past)ndash कधी आलास यत इतकच (M)ndash कब आय बस अभ आय (H)ndash When did you come Just now (I came) (H)ndash kakhan ele ei elaameshchhi (B)

bull निसशय भनिषय (certainty in future)ndash आत त मार खात खास (M)ndash अब मार खायगा (H)ndash He is in for a thrashing (E)ndash ekhan o maar khaabeikhaachhei

bull आशवास (assurance)ndash मा तमला उदया भटत (M)ndash माamp आप स कला मिमालात ह (H)ndash I will see you tomorrow (E)ndash aami kaal tomaar saathe dekhaa korbokorchhi (B)

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

11

Interlingua Based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

12

KBMT

bull Nirenberg Knowledge Based Machine Translation Machine Translation 1989

bull Forerunner of many interlingua based MT efforts

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

13

IL based MT schematic

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

14

Specification of KBMT

bull Source languages English and Japanese

bull Target languages English and Japanese bull Translation paradigm Interlingua bull Computational architecture A

distributed coarsely parallel systembull Subworld (domain) of translation

personal computer installation and maintenance manuals

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

15

Knowledge base of KBMT (12)

bull An ontology (domain model) of about 1500 concepts

bull Analysis lexicons about 800 lexical units of Japanese and about 900 units of English

bull Generation lexicons about 800 lexical units of Japanese and about 900 units of English

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

16

Knowledge base of KBMT (22)

bull Analysis grammars for English and Japanese

bull Generation grammars for English and Japanese

bull Specialized syntax-semantics structural mapping rules

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

17

Underlying formlisms

bull The knowledge representation system FrameKit

bull A language for representing domain models (a semantic extension of FRAMEKIT)

bull Specialized grammar formalisms based on Lexical-Functional Grammar

bull A specially constructed language for representing text meanings (the interlingua)

bull The languages of analysis and generation lexicon entries and of the structural mapping rules

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

18

Procedural components

bull A syntactic parser with a semantic constraint interpreter

bull A semantic mapper for treating additional types of semantic constraints

bull An interactive augmentor for treating residual ambiguities

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

19

Architecture of KBMT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

20

Digression language typology

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

22

Language and dialects

bull There are 6909 living languages (2009 census)

bull Dialects far outnumber the languagesbull Language varieties are called dialects

ndash if they have no standard or codified formndash if the speakers of the given language do not have a

state of their ownndash if they are rarely or never used in writing (outside

reported speech)ndash if they lack prestige with respect to some other often

standardised variety

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

23

ldquoLinguisticrdquo interlingua

bull Three notable attempts at creating interlinguandash ldquoInterlinguardquo by IALA (International Auxiliary

Language Association)ndash ldquoEsperantordquondash ldquoIdordquo

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

24

The Lords Prayer in Esperanto Interlingua version Latin version English version

(traditional)

1 Patro Nia kiu estas en la ĉielovia nomo estu sanktigita2 Venu via regno3 plenumiĝu via volokiel en la ĉielo tiel ankaŭ sur la tero4 Nian panon ĉiutagan donu al ni hodiaŭ5 Kaj pardonu al ni niajn ŝuldojnkiel ankaŭ ni pardonas al niaj ŝuldantoj6 Kaj ne konduku nin en tentonsed liberigu nin de la malbonoAmen

1 Patre nostre qui es in le celosque tu nomine sia sanctificate2 que tu regno veni3 que tu voluntate sia facitecomo in le celo etiam super le terra4 Da nos hodie nostre pan quotidian5 e pardona a nos nostre debitascomo etiam nos los pardona a nostre debitores6 E non induce nos in tentationsed libera nos del malAmen

1 Pater noster qui es in caeliglissanctificetur nomen tuum2 Adveniat regnum tuum3 Fiat voluntas tuasicut in caeliglo et in terraPanem nostrum 4 quotidianum da nobis hodie5 et dimitte nobis debita nostrasicut et nos dimittimus debitoribus nostris6 Et ne nos inducas in tentationemsed libera nos a maloAmen

1 Our Father who art in heavenhallowed be thy name2 thy kingdom come3 thy will be doneon earth as it is in heaven4 Give us this day our daily bread5 and forgive us our debtsas we have forgiven our debtors6 And lead us not into temptationbut deliver us from evilAmen

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

25

ldquoInterlinguardquo and ldquoEsperantordquo

bull Control languages for ldquoInterlinguardquo French Italian Spanish Portuguese English German and Russian

bull Natural words in ldquointerlinguardquobull ldquomanufacturedrdquo words in Esperanto using

heavy agglutinationbull Esperanto word for hospitalldquo

malmiddotsanmiddotulmiddotejmiddoto mal (opposite) san (health) ul (person) ej (place) o (noun)

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

26

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

27

Language Universals vs Language Typology

bull ldquoUniversalsrdquo is concerned with what human languages have in common while the study of typology deals with ways in which languages differ from each other

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

28

Typology basic word order

bull SOV (Japanese Tamil Turkish etc) bull SVO (Fula Chinese English etc) bull VSO (Arabic Tongan Welsh etc)

ldquoSubjects tend strongly to precede objectsrdquo

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

29

Typology Morphotactics of singular and plural

bull No expression Japanese hito person pl hito

bull Function word Tagalog bato stone pl mga bato

bull Affixation Turkish ev house pl ev-ler Swahili m-toto child pl wa-toto

bull Sound change English man pl men Arabic rajulun man pl rijalun

bull Reduplication Malay anak child pl anak-anak

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

30

Typology Implication of word order

Due to its SVO nature English hasndash preposition+noun (in the house) ndash genitive+noun (Toms house) or noun+genitive

(the house of Tom) ndash auxiliary+verb (will come) ndash noun+relative clause (the cat that ate the rat) ndash adjective+standard of comparison (better than

Tom)

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

31

Typology motion verbs (13)

bull Motion is expressed differently in different languages and the differences turn out to be highly significant

bull There are two large typesndash verb-framed and ndash satellite-framed languages

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

32

Typology motion verbs (23)

bull In satellite-framed languages like English the motion verb typically also expresses manner or causendash The bottle floated out of the cave (Manner)ndash The napkin blew off the table (Cause)ndash The water rippled down the stairs (Manner)

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

33

Typology motion verbs (33)

bull Verb-framed languages express manner and cause not in the verb but in a more peripheral element

bull Spanish ndash La botella salioacute de la cueva flotandondash The bottle exited the cave floatingly

bull Japanesendash Bin-ga dookutsu-kara nagara-te de -ta ndash bottle-NOM cave-from float-GER exit-PAST

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

34

IITB effort at KBMT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

35

MT EnConverion + Deconversion

Interlingua(UNL)

English

French

Hindi

Chinese

generation

Analysis

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Challenges of interlingua generation

Mother of NLP problems - Extract meaning from a sentence

Almost all NLP problems are sub-problems Named Entity Recognition POS tagging Chunking Parsing Word Sense Disambiguation Multiword identification and the list goes on

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

UNL a United Nations project

bull Started in 1996bull 10 year programbull 15 research groups across continentsbull First goal generatorsbull Next goal analysers (needs solving various ambiguity

problems)bull Current active language groups

ndash UNL_French (GETA-CLIPS IMAG)ndash UNL_Hindi (IIT Bombay with additional work on UNL_English)ndash UNL_Italian (Univ of Pisa)ndash UNL_Portugese (Univ of Sao Paolo Brazil)ndash UNL_Russian (Institute of Linguistics Moscow)ndash UNL_Spanish (UPM Madrid)

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

38

World-wide Universal Networking Language (UNL) Project

UNL

English Russian

Japanese

Hindi

Spanish

bull Language independent meaning representation

Marathi

Others

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

39

UNL represents knowledge John eats rice with a spoon

Semantic relations

attributes

Universal words

Repositoryof 42SemanticRelations and84 attributelabels

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

40

Sentence embeddings

Deepa claimed that she had composed a poem

[UNL]agt(claimentrypast Deepa)

obj(claimentrypast 01)

agt01(composepastentrycomplete she)

obj01(composepastentrycomplete poemindef)

[UNL]

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

41

The Lexicon

Content words

[forward] ldquoforward(iclgtsend)rdquo (VVOA) ltE00gt

[mail] ldquomail(iclgtmessage)rdquo (NPHSCLINANI) ltE00gt

[minister] ldquominister(iclgtperson)rdquo (NANIMTPHSCLPRSN) ltE00gt

Headword Universal Word Attributes

He forwarded the mail to the minister

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

42

The Lexicon (cntd)

function words

[he] ldquoherdquo (PRONSUBSING3RD) ltE00gt

[the] ldquotherdquo (ARTTHE) ltE00gt

[to] ldquotordquo (PRETO) ltE00gt

Headword Universal Word

Attributes

He forwarded the mail to the minister

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

43

How to obtain UNL expresssions

bull UNL nerve center is the verbbull English sentences

ndash A ltverbgt B orndash A ltverbgt

verb

BA

A

verb

rel relrel

Recurse on A and B

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Enconversion

Long sequence of masters theses Anupama Krishna Sandip Abhishek Gourab Subhajit Janardhan in collaboration with Rajat Jaya

Gajanan Rajita

Many publications

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

System Architecture

Stanford Dependency

Parser

XLE Parser

Feature Generation

Attribute Generation

Relation Generation

Simple SentenceEnconverter

NER

Stanford Dependency Parser

WSD

Clause Marker

Merger

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

SimpleEnco

Simplifier

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Complexity of handling long sentences

Problem As the length (number of words)

increases in the sentence the XLE parser fails to capture the dependency relations precisely

The UNL enconversion system is tightly coupled with XLE parser

Resulting in fall of accuracy

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Solution

bull Break long sentences to smaller onesndash Sentence can only be broken at the clausal

levels Simplificationndash Once broken they needs to be joined Merging

bull Reduce the reliance of the Encoverter on XLE parserndash The rules for relation and attribute generation

are formed on Stanford Dependency relations

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Text Simplifier

bull It simplifies a complex or compound sentence

bull Example ndash Sentence John went to play and Jane went to

schoolndash Simplified John went to play Jane went to school

bull The new simplifier is developed on the clause-markings and the inter-clause relations provided by the Stanford dependency parser

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

UNL Merging

bull Merger takes multiple UNLs and merges them to form one UNL

bull Example-ndash Input

bull agt(goentry John) pur(goentry play) bull agt(goentry Mary) gol(goentry school)

ndash Outputbull agt(goentry John) pur(goentry play)bull and (go 01)bull agt01(goentry Mary) gol01(goentry school)

bull There are several cases based on how the sentence was simplified Merger uses rules for each of the cases to merge them

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

NLP tools and resources for UNL generation

Tools

bull Stanford Named Entity Recognizer(NER)bull Stanford Dependency Parserbull Word Sense Disambiguation(WSD) systembull Xerox Linguistic Environment(XLE) Parser

Resource

bull Wordnetbull Universal Word Dictionary(UW++)

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

System Processing Units

Syntactic Processing

bullNERbullStanford Dependency ParserbullXLE Parser

Semantic Processing

bullStems of wordsbullWSDbullNoun originating from verbsbullFeature GenerationbullNoun featuresbullVerb features

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

SYNTACTIC PROCESSING

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntactic Processing NER

bull NER tags the named entities in the sentence with their types

bull Types coveredndash Personndash Placendash Organization

bull Input Will Smith was eating an apple with a fork on the bank of the river

bull Output ltPERSONgtWill SmithltPERSONgt was eating an apple with a fork on the bank of the river

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntactic Processing Stanford Dependency Parser (12)

bull Stanford Dependency Parser parses the sentence and produces

ndash POS tags of words in the sentence

ndash Dependency parse of the sentence

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntactic Processing Stanford Dependency Parser (22)

Will-Smith was eating an apple with a fork on the bank of the river

Will-SmithNNP wasVBD eatingVBG anDT appleNN withIN aDT forkNN onIN theDT bankNN ofIN theDT riverNN

POS tags

nsubj(eating-3 Will-Smith-1)aux(eating-3 was-2)det(apple-5 an-4)dobj(eating-3 apple-5)prep(eating-3 with-6)det(fork-8 a-7)pobj(with-6 fork-8)prep(eating-3 on-9)det(bank-11 the-10)pobj(on-9 bank-11)prep(bank-11 of-12)det(river-14 the-13)pobj(of-12 river-14)

Dependency Relations

Input

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntactic Processing XLE Parser

bull XLE parser generates dependency relations and some other important information

lsquoOBJ(eat8apple16)SUBJ(eat8will-smith0)ADJUNCT(apple16on20)lsquoOBJ(on20bank22)ADJUNCT(apple16with17)OBJ(with17fork19)ADJUNCT(bank22of23)lsquoOBJ(of23river25)

Dependency Relations

PASSIVE(eat8-)PROG(eat8+lsquo)TENSE(eat8past)VTYPE(eat8main)

ImportantInformation

Will-Smith was eating an apple with a fork on the bank of the river

Input

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

SEMANTIC PROCESSING

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing Finding stems

bull Wordnet is used for finding the stems of the words

Word Stemwas be

eating eat

apple apple

fork fork

bank bank

river river

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing WSDWord Synset-id Sense-id

Will-Smith - -

was 02610777 be24203

eating 01170802 eat23400

an - -

apple 07755101 apple11300

with - -

a - -

fork 03388794 fork10600

on - -

the - -

bank 09236472 bank11701

of - -

the - -

river 09434308 river11700

WSD

Word POS

Will-Smith

Proper noun

was Verb

eating Verb

an Determiner

apple Noun

with Preposition

a Determiner

fork Noun

on Preposition

the Determiner

bank Noun

of Preposition

the Determiner

river Noun

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing Universal Word Generation

bull The sense-ids are used to get the Universal Words from the Universal Word dictionary (UW++)

Word Universal WordWill-Smith Will-Smith(iofgtPERSON)

eating eat(iclgtconsume equgtconsumption)

apple apple(iclgtedible fruitgtthing)

fork fork(iclgtcutlerygtthing)

bank bank(iclgtslopegtthing)

river bank(iclgtstreamgtthing)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing Nouns originating from verbs

bull Some nouns are originated from verbs

bull The frame noun1 of noun2 where noun1 has originated from verb1 should generate relation obj(verb1 noun2) instead of any relation between noun1 and noun2

Traverse the hypernym-path of the noun to its root

Any word on the path is ldquoactionrdquo then the noun has originated from a verb

Algorithm removal of garbage =

removing garbage Collection of stamps =

collecting stamps Wastage of food =

wasting food

Examples

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Features of UWs

bull Each UNL relation requires some properties of their UWs to be satisfied

bull We capture these properties as features

bull We classify the features into Noun and Verb Features

Relation Word Property

agt(uw1 uw2) uw1 Volitional verb

met(uw1 uw2) uw2 Abstract thing

dur(uw1 uw2) uw2 Time period

ins(uw1 uw2) uw2 Concrete thing

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing Noun Features

bull Add wordNE_type to wordfeatures

bull Traverse the hypernym-path of the noun to its root

bull Any word on the path matches a keyword corresponding feature generated

Noun Feature

Will-Smith PERSON ANIMATE

bank PLACE

fork CONCRETE

Algorithm

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Types of verbs

bull Verb features are based on the types of verbs

bull There are three typesndash Do Verbs that are performed volitionallyndash Occur Verbs that are performed involuntarily or just

happen without any performerndash Be Verbs which tells about existence of something

or about some fact

bull We classify these three into do and not-do

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Semantic Processing Verb Features

bull The sentence frames of verbs are inspected to find whether a verb is ldquodordquo type of verb or not

bull ldquodordquo verbs are volitional verbs which are performed voluntarily

bull Heuristic ndash Somethingsomebody hellips somethingsomebody then ldquodordquo type

bull This heuristic fails to mark some volitional verbs as ldquodordquo but marks all non-volitional as ldquonot-dordquo

Word Type Featurewas not do -

eating do do

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

GENERATION OF RELATIONS AND ATTRIBUTES

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Relation Generation (12)

bull Relations handled separately

bull Similar relations handled in cluster

bull Rule based approach

bull Rules are applied on ndash Stanford Dependency relationsndash Noun featuresndash Verb featuresndash POS tagsndash XLE generated information

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Relation Generation (22)

Conditions Relations Generated

nsubj(eating Will-Smith)Will-Smithfeature=ANIMATEeat is do type and active

agt(eat Will Smith)

dobj(eating apple)eat is active

obj(eat apple)

prep(eating with) pobj(with fork)forkfeature=CONCRETE

ins(eat fork)

prep(eating on) pobj(on bank)bankfeature=PLACE

plc(eat bank)

prep(bank of) pobj(of river)bankPOS = nounriverfeature is not PERSON

mod(bank river)

Will-Smith was eating an apple with a fork on the bank of the riverInput

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Attribute Generation (12)

bull Attributes handled in clusters

bull Attributes are generated by rules

bull Rules are applied onndash Stanford dependency relationsndash XLE informationndash POS tags

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Attribute Generation (22)

Conditions Attributes Generated

eat is verb of main clauseVTYPE(eat main)PROG(eat+)TENSE(eatpast)

eatentryprogresspast

det(apple an) appleindef

det(fork a) forkindef

det(bank the) bankdef

det(river the) riverdef

Will-Smith was eating an apple with a fork on the bank of the riverInput

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

EVALUATION OF OUR SYSTEM

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Resultsbull Evaluated on EOLSS corpus

ndash 7000 sentences from EOLSS corpus

bull Evaluated on 4 scenariosndash Scenario1 Previous system (tightly coupled with XLE parser)ndash Scenario2 Current system (tightly coupled with Stanford

parser)ndash Scenario3 Scenario2 ndash (Simplifier + Merger)ndash Scenario4 Scenario2 ndash XLE parser

bull Scenario3 lets us know the impact of Simplifier and Merger

bull Scenario4 lets us know the impact of XLE parser

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Metricsbull tgen = relationgen (UW1gen UW2gen)

bull tgold = relationgold (UW1gold UW2gold)

bull tgen = tgold if

ndash relationgen = relationgold

ndash UW1gen = UW1gold

ndash UW2gen = UW2gold

bull If UWgen = UWgold let amatch(UWgen UWgold) = attributes matched in UWgen and UWgold

bull count(UWx)= attributes in UWx

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Relation-wise accuracy

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Overall accuracy

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Relation

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Relation-wise Precision

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Relation-wise Recall

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Relation-wise F-score

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Attribute

Scenario1 Scenario2 Scenario3 Scenario40

005

01

015

02

025

03

035

04

045

05

Attribute Accuracy

Accuracy

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results Time-taken

bull Time taken for the systems to evaluate 7000 sentences from EOLSS corpus

Scenario1 Scenario2 Scenario3 Scenario40

5

10

15

20

25

30

35

40

Time taken

Time taken

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

ERROR ANALYSIS

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Wrong Relations Generated

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

False positive

False positive

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Missed Relations

obj mod aoj man and agt plc qua pur gol tim or overall0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

False negative

False negative

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Case Study (13)

bull Conjunctive relationsndash Wrong apposition detection

bull James loves eating red water-melon apples peeled nicely and bananas

amod(water-melon-5 red-4)appos(water-melon-5 apples-7)partmod(apples-7 peeled-8)advmod(peeled-8 nicely-9)cc(nicely-9 and-10)conj(nicely-9 bananas-11)

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Case Study (23)bull Conjunctive relations

ndash Non-uniform relation generationbull George the president of the football club

James the coach of the club and the players went to watch their rivals game

appos(George-1 president-4)conj(club-8 James-10)conj(club-8 coach-13)cc(club-8 and-17)conj(club-8 players-19)

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Case Study (33)bull Low Precision frequent relations

ndash mod and quabull More general rules

ndash agt-aoj non-uniformity in corpusndash Gol-obj Multiple relation generation possibility

The former however represents the toxic concentration in the water while the latter represents the dosage to the body of the test organisms

agt(represent latter)aoj(represent former)obj(dosage body)

Corpus Relations

aoj(represent latter)aoj(represent former)gol(dosage body)

Generated Relations

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Hindi Generation from Interlingua (UNL)

(Joint work with S Singh M Dalal V Vachhani Om Damani

MT Summit 2007)

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

HinD Architecture

Deconversion = Transfer + Generation

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Step-through the Deconverter

Module Output UNL Expression

obj(contact(hellip

Lexeme Selection

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar khatav Case Identification

सपक+ निकस य आप कषतर तलक मा चर खाट

contact farmer this you region taluka manchar

khatav Morphology Generation

सपक+ क0जि2ए निकस4 इस आप कषतर

contact imperative farmerpl this you region

तलक मा चर खाट

taluka manchar Khatav Function Word Insertion

सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर

contact farmers this for you region

य तलक क मा चर खाट

or taluka of Manchar Khatav Linearization इसक लिलए आप मा चर कषतर य खाट

This for you manchar region or khatav

तलक क निकस4 क सपक+ क0जि2ए |

taluka of farmers contact

you

this

farmer

agtobj

pur

plc

contact

nam

orregion

khatav

manchar

taluka

nam

01

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Lexeme Selection

[सपक+ ]contact(iclgtcommunicate(agtgtpersonobjgtperson))ldquo (VVOAVOA-ACTVOA-COMMVLTNTMPCJNCTN-VlinkVa)

[पच कवयलि7]contact(iclgtrepresentative)ldquo

(NANIMTFAUNAMMLPRSNNa)

Lexical Choice is unambiguous

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Case Marking

Relation Parent+ Parent- Child+ Child- Obj V VINT N Agt V past N

bull Depends on UNL Relation and the properties of the nodesbull Case get transferred from head to modifiers

you आप

this यfarmer निकस

agt obj contact सपक+

pur

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Morphology Generation Nouns

Suffix Attribute values

uoM NNUMploblique

U NNUMsgoblique I NNIFsgoblique iyoM NNIFploblique

oM NNANOTCHFploblique

The boy saw meलड़क माझ दखा Boys saw meलड़क माझ दखा The King saw meर2 माझ दखा Kings saw meर2ऒ माझ दखा

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Verb Morphology

Suffix Tense Aspect Mood N Gen P VE

-e rahaa thaa

past progress - sg male 3rd e

-taa hai present custom - sg male 3rd -

-iyaa thaa past complete - sg male 3rd I

saktii hain present - ability pl female 3rd A

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

After Morphology Generation

you आप

this यfarmer निकस

agt obj contact सपक+

pur

you आप

this इस

farmer निकस4

agt obj contact सपक+ क0जि2ए

pur

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Function Word Insertion

सपक+ क0जि2ए निकस4 य आप कषतर तलक माचर खाट सपक+ क0जि2ए निकस4 क इसक लिलए आप कषतर य तलक क माचर खाट

you आप

this इसक लिलए

farmer निकस4 क

agt obj contact सपक+ क0जि2ए

pur

Rel Par+ Par- Chi+ Chi- ChFW

Obj V VINT NANIMT topic क

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Linearization

इस आप माचर कषतरThis you manchar region

खाट तलक निकस4 khatav taluka farmers

सपक+ contact

you

this

farmer

agtobj

pur

plc

Contact सपक+

क0जि2ए

nam

orregion

khatav

manchar

taluka

nam

01

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntax Planning Assumptions

bull The relative word order of a UNL relationrsquos relata does not depend onndash Semantic Independence the

semantic properties of the relata

ndash Context Independence the rest of the expression

bull The relative word order of various relations sharing a relatum does not depend onndash Local Ordering the rest of the

expression

this

contact

pur

you

this

farmer

agt obj

contact

pur

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

agt aoj obj ins

Agt L L L

Aoj L L

Obj R

Ins

you

this

farmer

agt obj contact

pur

Syntax Planning Strategy

bull Divide a nodes relations in Before_set = objpuragt

After_set =

obj pur

agt

Topo Sort each group pur agt obj

this you farmer

Final order this you farmer contact

region

khatav

manchtaluka

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Syntax Planning AlgoStack Before

CurrentAfterCurrent

Output

regionfarmercontact

this you

manchar taluka

mancharregiontalukafarmercontact

regiontalukafarmerContact

this youmanchar

taluka farmercontact

this youmancharregion

you

this

farmer

agtobj

pur

plc

Contact

nam

orregion

khatav

manchar

taluka

nam

01

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

All Together (UNL -gt Hindi)

Module Output UNL Expression

obj(contact(

Lexeme Selection

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar khatav Case Identification

Ȳ [ Ǒ ȡ Q iexcl ] centȯğQ ȡ Ǖ Q Ȳ ȡ

contact farmer this you region taluka manchar

khatav Morphology Generation

Ȳ [ ȧǔ f Ǒ ȡ ɉ iexcl ] centȯğ

contact imperative farmerpl this you region

ȡ Ǖ ȯ Ȳ ȡ

taluka manchar Khatav Function Word Insertion

Ȳ [ ȧǔ f Ǒ ȡɉ Ȫ ȯ Ǔ f ] centȯğ

contact farmers this for you region

ȡ ȡ Ǖȯ ȯ Ȳ ȡ

or taluka of Manchar Khatav Syntax Linearization

ȯ Ǔ f ] Ȳ centȯğ ȡ ȡ

This for you manchar region or khatav

ȡ Ǖȯ ȯ Ǒ ȡɉ Ȫ Ȳ [ ȧǔ f |

taluka of farmers contact

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

How to Evaluate UNL Deconversion

bull UNL -gt Hindibull Reference Translation needs to be generated from

UNLndash Needs expertise

bull Compromise Generate reference translation from original English sentence from which UNL was generatedndash Works if you assume that UNL generation was perfect

bull Note that fidelity is not an issue

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Manual Evaluation Guidelines

Fluency of the given translation is(4) Perfect Good grammar(3) Fair Easy-to-understand but flawed grammar(2) Acceptable Broken - understandable with effort(1) Nonsense Incomprehensible Adequacy How much meaning of the reference sentence is conveyed in the translation(4) All No loss of meaning(3) Most Most of the meaning is conveyed(2) Some Some of the meaning is conveyed(1) None Hardly any meaning is conveyed

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Sample Translations

Hindi Output Fluency Adequacy

कक क करण आमा क0 2क पततिय यदिद झलस र amp 05 परनितशत क बरडोD मिमाशरण 10 लटर प क सथ त लिGड़क 2 लिचए

2 3

परकषण क अरप नियमिमात रप स फल4 क0 अचछीL MजिN क लिलए खाद4 क0 खारकO दL 2 चनिए

3 4

माO माथ क फसल क बद बत P माथ और धीनिय फसल क अचछी बढ य अचछी बढ बत

1 1

2णविTक सकरमाण स इसक0 2ड़O परभनित त amp 4 4

इमा क पकष रटइट क परिरर क सबमिधीत त P और थड़ य शतरमागा+ क सथ समा दिदखात amp

2 3

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Results

BLEU Fluency Adequacy Geometric Average 034 254 284 Arithmetic Average 041 271 300 Standard Deviation 025 089 089 Pearson Cor BLEU 100 059 050

Pearson Cor Fluency 059 100 068

Fluency vs Adequacy

315 3 4

34

165

2662

155138

210 12

131

168

0

50

100

150

200

1 2 3 4

Fluency

Nu

mb

er o

f s

ente

nce

s

Adequacy 1 Adequacy 2 Adequacy 3 Adequacy 4

bull Good Correlation between Fluency and BLUEbull Strong Correlation betweenFluency and Adequacybull Can do large scale evaluation

using Fluency alone

Caution Domain diversity Speaker diversity

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Transfer Based MT

Marathi-Hindi

106

Deep understanding level

Interlingual level

Logico-semantic level

Syntactico-functional level

Morpho-syntactic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semanticamp predicate-arg ument)

F-structures (functional)

C-structures (constituent)

Tagged text

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

bull Consortium mode project

bull Funded by DeiTY

bull 11 Participating Institutes

bull Nine language pairs

bull 18 Systems

Sampark IL-IL MT systems

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Participating institutions

IIIT Hyderabad (Lead institute) University of Hyderabad IIT Bombay IIT Kharagpur AUKBC Chennai Jadavpur University Kolkata Tamil University Thanjavur IIIT Trivandrum IIIT Allahabad IISc Bangalore CDAC Noida

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Objectives Develop general purpose MT systems from one IL to another

for 9 language pairs

Bidirectional

Deliver domain specific versions of the MT systems Domains are Tourism and pilgrimage One additional domain (healthagriculture box office reviews electronic

gadgets instruction manuals recipes cricket reports)

By-products basic tools and lexical resources for Indian languages

POS taggers chunkers morph analysers shallow parsers NERs parsers etc

Bidirectional bilingual dictionaries annotated corpora etc

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Language Pairs (Bidirectional)

Tamil-Hindi Telugu-Hindi Marathi-Hindi Bengali-Hindi Tamil-Telugu Urdu-Hindi Kannada-Hindi Punjabi-Hindi Malayalam-Tamil

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Sampark Architecture

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Horizontal Tasks

H1 POS Tagging amp Chunking engine H2 Morph analyser engine H3 Generator engine H4 Lexical disambiguation engine H5 Named entity engine H6 Dictionary standards H7 Corpora annotation standards H8 Evaluation of output (comprehensibility) H9 Testing amp integration

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Vertical Tasks for Each Language

V1 POS tagger amp chunker V2 Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual dictionary ndash bidirectional V6 Transfer grammar V7 Annotated corpus V8 Evaluation V9 Co-ordination

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

SSF Example-1

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

SSF Example-2

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Evaluation

All 18 ILMT systems are integrated with updated modules

Regular internal subjective evaluation is carried out at IIIT Hyderabad

Third party evaluation by C-DAC Pune

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Evaluation ScaleGrade Description

0Nonsense (If the sentence doesnt make any sense at all ndash it is like someone speaking to you in a language you dont know)

1

Some parts make sense but is not comprehensible over all (eg listening to a language which has lots of borrowed words from your language ndash you understand those words but nothing more)

2Comprehensible but has quite a few errors (eg someone who can speak your language but would make lots of errors However you can make sense out of what is being said)

3 Comprehensible occasional errors (eg someone speaking Hindi getting all its genders wrong)

4 Perfect (eg someone who knows the language)

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

ILMT Evaluation Results

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Evaluation of Marathi to Hindi MT System

bull Module-wise evaluation ndash Evaluated on 500 web sentences

Morph Ana-lyzer

POS Tagger Chunker Vibhakti Compute

WSD Lexical Transfer

Word Generator

0

02

04

06

08

1

12

Precision

Recall

Module-wise precision and recall

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Evaluation of Marathi to Hindi MT System (cont)

bull Subjective evaluation on translation qualityndash Evaluated on 500 web sentencesndash Accuracy calculated based on score given according to the

translation qualityndash Accuracy 6532

bull Result analysisndash Morph POS tagger chunker gives more than 90 precision

but Transfer WSD generator modules are below 80 hence degrades MT quality

ndash Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Important challenge of M-H Translation-

Morphology processingkridanta

Ganesh Bhosale Subodh Kembhavi Archana Amberkar Supriya Mhatre Lata Popale and Pushpak Bhattacharyya Processing of Participle (Krudanta) in Marathi International Conference on Natural Language Processing (ICON 2011)

Chennai December 2011

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Kridantas can be in multiple POS categories Nouns Verb Noun

च vaachread चण vaachaNereading

उतर utaraclimb down उतरण utaraNdownward slope

Adjectives Verb Adjective

च chavbite चणर chaavaNaaraone who bites

खा khaa eat खाललल khallele something that is eaten

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Kridantas derived from verbs (cont)

Adverbs

Verb Adverb

पळ paLrun पळताना paLataanaawhile running

बस bassit ब सना basunafter sitting

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Kridanta TypesKridanta Type

Example Aspect

ldquo rdquo ण Ne-Kridanta

vaachNyaasaaThee pustak de (Give me a book for reading)

For reading book give

Perfective

ldquo rdquo ल laa-Kridanta

Lekh vaachalyaavar saaMgen (I will tell you that after reading the article)

Article after reading will tell

Perfective

ldquo rdquoतTaanaa-Kridanta

Pustak vaachtaanaa te lakShaat aale (I noticed it while reading the book)

Book while reading it in mind came

Durative

ldquo rdquoललLela-Kridanta

kaal vaachlele pustak de (Give me the book that (Iyou) read yesterday )

Yesterday read book give

Perfective

ldquo rdquoऊ Un-Kridanta

pustak vaachun parat kar (Return the book after reading it)

Book after reading back do

Completive

ldquo rdquoणर Nara-Kridanta

pustake vaachNaaRyaalaa dnyaan miLte (The one who reads books gets knowledge)

Books to the one who reads knowledge gets

Stative

ldquo rdquo ve-Kridanta

he pustak pratyekaane vaachaave (Everyone should read this book)

This book everyone should read

Inceptive

ldquo rdquo त taa-Kridanta

to pustak vaachtaa vaachtaa zopee gelaa (He fell asleep while reading a book)

He book while reading to sleep went

Stative

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Participial Suffixes in Other Agglutinative Languages

Kannada

muridiruwaa kombe jennu esee

Broken to branch throw

Throw away the broken branch

- similar to the lela form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Participial Suffixes in Other Agglutinative Languages (cont)

Telugu

ame padutunnappudoo nenoo panichesanoo

she singing I work

I worked while she was singing

-similar to the taanaa form frequently used in Marathi

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Participial Suffixes in Other Agglutinative Languages (cont)

Turkish

hazirlanmis plan

prepare-past plan

The plan which has been prepared

Eqv Marathi lelaa

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Morphological Processing of Kridanta forms (cont)

Fig Morphotactics FSM for Kridanta Processing

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Accuracy of Kridanta Processing Direct Evaluation

086

088

09

092

094

096

098

NeKridanta

LaKridanta

NaraKridanta

LelaKridanta

TanaKridanta

TKridanta

OonKridanta

VaKridanta

Precision

Recall

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

Summary of M-H transfer based MT

bull Marathi and Hindi are close cousinsbull Relatively easier problem to solvebull Will interlingua be betterbull Web sentences being used to test the

performancebull Rule governedbull Needs high level of linguistic expertisebull Will be an important contribution to IL MT

130

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT

131

Summary interlingua based MT

bull English to Hindibull Rule governedbull High level of linguistic expertise neededbull Takes a long time to build (since 1996)bull But produces great insight resources and

tools

  • CS712 Knowledge Based MT
  • Empiricism vs Rationalism
  • Introduction
  • Kinds of MT Systems (point of entry from source to the target t
  • Why is MT difficult Language Divergence
  • Language Divergence Theory Lexico-Semantic Divergences
  • Language Divergence Theory Lexico-Semantic Divergences (cntd)
  • Language Divergence Theory Syntactic Divergences
  • Language Divergence Theory Syntactic Divergences (2)
  • Language Divergence exists even for close cousins (Marathi-Hind
  • Interlingua Based MT
  • KBMT
  • IL based MT schematic
  • Specification of KBMT
  • Knowledge base of KBMT (12)
  • Knowledge base of KBMT (22)
  • Underlying formlisms
  • Procedural components
  • Architecture of KBMT
  • Digression language typology
  • Slide 21
  • Language and dialects
  • ldquoLinguisticrdquo interlingua
  • Slide 24
  • ldquoInterlinguardquo and ldquoEsperantordquo
  • Slide 26
  • Language Universals vs Language Typology
  • Typology basic word order
  • Typology Morphotactics of singular and plural
  • Typology Implication of word order
  • Typology motion verbs (13)
  • Typology motion verbs (23)
  • Typology motion verbs (33)
  • IITB effort at KBMT
  • MT EnConverion + Deconversion
  • Slide 36
  • UNL a United Nations project
  • World-wide Universal Networking Language (UNL) Project
  • UNL represents knowledge John eats rice with a spoon
  • Sentence embeddings
  • The Lexicon
  • The Lexicon (cntd)
  • How to obtain UNL expresssions
  • Enconversion
  • System Architecture
  • Complexity of handling long sentences
  • Solution
  • Text Simplifier
  • UNL Merging
  • NLP tools and resources for UNL generation
  • System Processing Units
  • Syntactic processing
  • Syntactic Processing NER
  • Syntactic Processing Stanford Dependency Parser (12)
  • Syntactic Processing Stanford Dependency Parser (22)
  • Syntactic Processing XLE Parser
  • Semantic processing
  • Semantic Processing Finding stems
  • Semantic Processing WSD
  • Semantic Processing Universal Word Generation
  • Semantic Processing Nouns originating from verbs
  • Features of UWs
  • Semantic Processing Noun Features
  • Types of verbs
  • Semantic Processing Verb Features
  • Generation of relations and attributes
  • Relation Generation (12)
  • Relation Generation (22)
  • Attribute Generation (12)
  • Attribute Generation (22)
  • Evaluation of our system
  • Results
  • Results Metrics
  • Results Relation-wise accuracy
  • Results Overall accuracy
  • Results Relation
  • Results Relation-wise Precision
  • Results Relation-wise Recall
  • Results Relation-wise F-score
  • Results Attribute
  • Results Time-taken
  • Error analysis
  • Wrong Relations Generated
  • Missed Relations
  • Case Study (13)
  • Case Study (23)
  • Case Study (33)
  • Hindi Generation from Interlingua (UNL)
  • HinD Architecture
  • Step-through the Deconverter
  • Lexeme Selection
  • Case Marking
  • Morphology Generation Nouns
  • Verb Morphology
  • After Morphology Generation
  • Function Word Insertion
  • Linearization
  • Syntax Planning Assumptions
  • Syntax Planning Strategy
  • Syntax Planning Algo
  • All Together (UNL -gt Hindi)
  • How to Evaluate UNL Deconversion
  • Manual Evaluation Guidelines
  • Sample Translations
  • Results (2)
  • Transfer Based MT
  • Sampark IL-IL MT systems
  • Slide 108
  • Slide 109
  • Slide 110
  • Slide 111
  • Slide 112
  • Slide 113
  • Slide 114
  • Slide 115
  • Slide 116
  • Slide 117
  • Slide 118
  • Evaluation of Marathi to Hindi MT System
  • Evaluation of Marathi to Hindi MT System (cont)
  • Important challenge of M-H Translation- Morphology processing
  • Slide 122
  • Slide 123
  • Kridanta Types
  • Participial Suffixes in Other Agglutinative Languages
  • Participial Suffixes in Other Agglutinative Languages (cont)
  • Participial Suffixes in Other Agglutinative Languages (cont) (2)
  • Slide 128
  • Accuracy of Kridanta Processing Direct Evaluation
  • Summary of M-H transfer based MT
  • Summary interlingua based MT