machine translation and lexical resources activity at iit bombay pushpak bhattacharyya computer...

29
Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of Technology Bombay [email protected] http://www.cse.iitb.ac.in/pb

Upload: marsha-hodge

Post on 12-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Machine Translation and Lexical Resources Activity at IIT Bombay

Pushpak Bhattacharyya

Computer Science and Engineering Department

Indian Institute of Technology Bombay

[email protected]

http://www.cse.iitb.ac.in/pb

Page 2: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Interlingua Methodology

Directly obtain the meaning of the source sentence.

Do target sentence generation from the meaning representation.

John gave the book to Mary.

Meaning representation:

give-action:

agent: john

object: the book

receiver: mary

Page 3: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Competing approaches

• Direct

• Transfer based

Page 4: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

MT Architectures: Vauquois' triangle

Deep understanding level

Interlingual le vel

Logico-semantic level

Syntactico-functio nal level

Morpho-syntac tic level

Syntagmatic level

Graphemic leve l Direct translation

Syntactic transfer (surface )

Syntactic transfer (deep)

Conceptual transfer

Semantic transfer

Multilevel transfer

Ontological interlingua

Semantico-linguistic interlingua

SPA-structures (semantic& predicate-argument)

F-structures (functional)

C-structures (constituent)

Tagged tex t

Text

Mixing levels Multilevel descriptio n

Semi-direct translatio n

Page 5: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

State of Affairs

• Systran reports 19 different langauge pairs.

• 8 alright for intended use.

• Even fewer are capable of quality written or spoken text translation.

Page 6: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

ENGLISH-SPANISH-ENGLISH

• ...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province

• ... en ese imperio, el arte de la cartografía logró tal perfección que el mapa de una sola provincia ocupó la totalidad de una ciudad, y el mapa del imperio, la totalidad de una provincia

• ... in that empire, the art of the cartography obtained such perfection that the map of a single province occupied the totality of a city, and the map of the empire, the totality of a province

Provided by Systran on 19/11/02

Page 7: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

ENGLISH-KOREAN-ENGLISH

• ...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province

• 저 제국안에 , 단순한 지방의 지도가 도시의 완전을 점유했다 고 Cartography 의 예술은 같은 얀벽 , 및 제국 , 지방의 완전의 지도 를 달성했다

• Inside that empire, the map of the region where it is simple occupied the perfection of the city the art of the Cartography is same, yan it attained the map of of perfection of the wall and

empire and region

Provided by Systran on 19/11/02

Page 8: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

UNL Based MT: the scenario

UNL

ENGLISH

HINDIFRENCH

RUSSIANENCONVERSION

DECONVERSION

Page 9: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Common language for computers to express information written in natural language (Uchida et. al. 2000)

Application:

Electronic language to overcome language barrier

Information Distribution System

Universal Networking Language

Page 10: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

UNL Example

agt obj plc

arrangearrange

JohnJohn meetingmeeting residenceresidence

Page 11: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Components of the UNL System

• Universal Word• Relation Labels• Attributes

Page 12: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Universal Word[saayaa] "shadow(icl>darkness)";

the place was now in shadow[laoSamaa~] "shadow(icl>iota)";

not a shadow of doubt about his guilt

[saMkot] "shadow(icl>hint)" ; the shadow of the things to come

[Cayaa] "shadow(icl>deterrant)"; a shadow over his happiness

Page 13: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Universal Word (foreign concepts)

[aput] "snow(icl>thing)"; [pukak] "snow(aoj<salt like)"; [mauja] "snow(aoj<soft,

aoj<deep)"; [massak] "snow(aoj<soft)"; [mangokpok] "snow(aoj<watery)";

Page 14: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Relationagt (agent) Agt defines a thing which initiates an action.agt (do, thing)Syntax

agt[":"<Compound UW-ID>] "(" {<UW1>|":"<Compound UW-ID>} "," {<UW2>|":"<Compound UW-ID>} ")"

Detailed DefinitionAgent is defined as the relation between:UW1 - do, andUW2 - a thingwhere:

UW2 initiates UW1, or

UW2 is thought of as having a direct role in making UW1 happen. Examples and readings

agt(break(icl>do), John(icl>person)) John breaksagt(translate(icl>do), computer(icl>machine)) computer translates

Page 15: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Attributes

• Used to describe what is said from the speaker's point of view.

• In particular captures number, tense, aspect and modality information.

Page 16: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Example Attributes

• I see a flowerUNL: obj(see(icl>do), flower(icl>thing))

• I saw flowersUNL: obj(see(icl>do).@past, flower(icl>thing).@pl)

• Did I see flowers?UNL: obj(see(icl>do).@past.@interrogative,

flower(icl>thing).@pl)• Please see the flowers?

UNL: obj(see(icl>do).@past.@request, flower(icl>thing).@pl.@definite)

Page 17: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

The Analyser MachhineEnconverter

AnalysisRules

Dictionary

C CC A A

nini+1 ni+2Node List

A

B E

D

C

Node-net

ni-1 ni+3

Page 18: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Strategy for Analysis

• Morphological Analysis

• Syntactico-Semantic Analysis

Page 19: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Analysis of a simple sentences<< A Report of John’s genius reached King’s ears>>

article and noun are combined and attribute@indef is added to the noun.

 

<<[Report ][of] John’s genius reached king’s ears>>

Right shift to put preposition with the succeeding noun.

 

<</Report /[of ][John’s] genius reached king’s ears>>

Ram’s being a possessing noun, shift right.

 

<</Report //of / [John’s] [genius] reached king’s ears>>

These two nouns are resolved into relation pos and first noun is deleted:

 

Page 20: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Simple sentence (continued)

<</Report /[of][genius] reached King’s ears>>The preposition of is then combined with noun and a dynamic attribute OFRES is added to entry of genius. <<[Report][of genius ] reached King’s ears>>Using the attribute OFRES these two nouns are resolved to relation mod and the second noun is deleted. <<[Report ][reached] King’s ears>>Shift right again and solve King’s ears, relation pof is generated.  

<</Report /[reached][ ears]>>Relation obj is generated here and then relation agt is generated between Report and ears <</reached />>

Page 21: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

UNL as Interlingua and Language Divergence

(Dave, Parikh, Bhattacharyya, JMT, 2003)

• Stands for the discrepancy in representation due to the inherent characteristics of the languages.

• Syntactic Divergence

• Lexical Semantic Divergence

Page 22: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Issue of free word order jaIma nao caaorI krnaovaalao laD,ko kao laazI sao

maara. jaIma nao laazI sao caaorI krnaovaalao laD,ko kao

maara. caaorI krnaovaalao laD,ko kao jaIma nao laazI sao

maara. caaorI krnaovaalao laD,ko kao laazI sao jaIma nao

maara. laazI sao jaIma nao caaorI krnaovaalao laD,ko kao

maara.• Use made of the fact that in Hindi post positions stay adjacent to nouns

(opposed to the preposition stranding divergence).• Flexibility in parsing- hit and preserve the predicate till the end.

Page 23: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Conjuct and compound verbs Typical Indian language phenomenon. Conjunct for verb-verb,

compound for other POS+verb.vah gaanao lagaI She started singing

H calao jaaAaoGo away.

H $k jaaAao E Stop there.H Jauk jaaAaoE Bend down.Possibility of combinatorial explosion in the lexicon. Possible solution:

wordnet?

Page 24: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Use of Lexical Resources

•Automatic Generation of the UW to language dictionary (Verma and Bhattacharyya, Global Wordnet Conference, Czeck Republic, 2004)

•Universal Word generation•Semantic attribute generation•Heavy use of wordnets and ontologies

Page 25: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Wordnet and Lexical Resources

•Approximately 12000 Hindi synsets corresponding to about 35000 root words of Hindi.•Approximately 7000 Hindi synsets corresponding to about 16000 root words of Hindi.•Verb Hierarchy of approximately 4000 unique words corresponding to 6000 senses.

Page 26: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Gloss

AQyana kxa

Hyponymy

Hyponymy

Aavaasa , inavaasa

Sayana kxa

rsaao[-Gar

Gar , gaRh manauYyaaoM ka

Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO

Aitiqa gaRh

baramada

Aa^Mgana

AaEama

JaaopD,I

saMrcanaa

Meronymy

Hyponymy

Meronymy

Hypernymy

WordNet Sub-Graph

Page 27: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Languages under StudyLanguage Analysis Status Generation

StatusEnglish D- 60000

R- 5000

D- 60000

R- 400

Hindi D- 75000

R- 5700

D- 75000

R- 6500

Marathi D- 4000

R- 2200

D- 4000

R- 6000

Bengali D- 500

R- 1800

D- 500

R- 2100

Page 28: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Conclusions

• Predicate preservation strategy used for English, Hindi, Marathi, Bengali (Spanish being added).

• Focus in marathi on morphology for Marathi.

• Focus on kaarak (case) system for Bengali.

• Extremely lexical knowledge hungry.

Page 29: Machine Translation and Lexical Resources Activity at IIT Bombay Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of

Conclusions

• Work going on in the creation of Indian language wordnets (Hindi, Marathi in IIT Bombay; Dravidian in Anna University).

• Interlingua has a the attractive possibility of being used as a knowledge representation and applying to interesting applications like summarization, text clustering, meaning based multilingual search engines.