machine translation and lexical resources activity at iit bombay pushpak bhattacharyya computer...
TRANSCRIPT
Machine Translation and Lexical Resources Activity at IIT Bombay
Pushpak Bhattacharyya
Computer Science and Engineering Department
Indian Institute of Technology Bombay
http://www.cse.iitb.ac.in/pb
Interlingua Methodology
Directly obtain the meaning of the source sentence.
Do target sentence generation from the meaning representation.
John gave the book to Mary.
Meaning representation:
give-action:
agent: john
object: the book
receiver: mary
Competing approaches
• Direct
• Transfer based
MT Architectures: Vauquois' triangle
Deep understanding level
Interlingual le vel
Logico-semantic level
Syntactico-functio nal level
Morpho-syntac tic level
Syntagmatic level
Graphemic leve l Direct translation
Syntactic transfer (surface )
Syntactic transfer (deep)
Conceptual transfer
Semantic transfer
Multilevel transfer
Ontological interlingua
Semantico-linguistic interlingua
SPA-structures (semantic& predicate-argument)
F-structures (functional)
C-structures (constituent)
Tagged tex t
Text
Mixing levels Multilevel descriptio n
Semi-direct translatio n
State of Affairs
• Systran reports 19 different langauge pairs.
• 8 alright for intended use.
• Even fewer are capable of quality written or spoken text translation.
ENGLISH-SPANISH-ENGLISH
• ...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province
• ... en ese imperio, el arte de la cartografía logró tal perfección que el mapa de una sola provincia ocupó la totalidad de una ciudad, y el mapa del imperio, la totalidad de una provincia
• ... in that empire, the art of the cartography obtained such perfection that the map of a single province occupied the totality of a city, and the map of the empire, the totality of a province
Provided by Systran on 19/11/02
ENGLISH-KOREAN-ENGLISH
• ...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province
• 저 제국안에 , 단순한 지방의 지도가 도시의 완전을 점유했다 고 Cartography 의 예술은 같은 얀벽 , 및 제국 , 지방의 완전의 지도 를 달성했다
• Inside that empire, the map of the region where it is simple occupied the perfection of the city the art of the Cartography is same, yan it attained the map of of perfection of the wall and
empire and region
Provided by Systran on 19/11/02
UNL Based MT: the scenario
UNL
ENGLISH
HINDIFRENCH
RUSSIANENCONVERSION
DECONVERSION
Common language for computers to express information written in natural language (Uchida et. al. 2000)
Application:
Electronic language to overcome language barrier
Information Distribution System
Universal Networking Language
UNL Example
agt obj plc
arrangearrange
JohnJohn meetingmeeting residenceresidence
Components of the UNL System
• Universal Word• Relation Labels• Attributes
Universal Word[saayaa] "shadow(icl>darkness)";
the place was now in shadow[laoSamaa~] "shadow(icl>iota)";
not a shadow of doubt about his guilt
[saMkot] "shadow(icl>hint)" ; the shadow of the things to come
[Cayaa] "shadow(icl>deterrant)"; a shadow over his happiness
Universal Word (foreign concepts)
[aput] "snow(icl>thing)"; [pukak] "snow(aoj<salt like)"; [mauja] "snow(aoj<soft,
aoj<deep)"; [massak] "snow(aoj<soft)"; [mangokpok] "snow(aoj<watery)";
Relationagt (agent) Agt defines a thing which initiates an action.agt (do, thing)Syntax
agt[":"<Compound UW-ID>] "(" {<UW1>|":"<Compound UW-ID>} "," {<UW2>|":"<Compound UW-ID>} ")"
Detailed DefinitionAgent is defined as the relation between:UW1 - do, andUW2 - a thingwhere:
UW2 initiates UW1, or
UW2 is thought of as having a direct role in making UW1 happen. Examples and readings
agt(break(icl>do), John(icl>person)) John breaksagt(translate(icl>do), computer(icl>machine)) computer translates
Attributes
• Used to describe what is said from the speaker's point of view.
• In particular captures number, tense, aspect and modality information.
Example Attributes
• I see a flowerUNL: obj(see(icl>do), flower(icl>thing))
• I saw flowersUNL: obj(see(icl>do).@past, flower(icl>thing).@pl)
• Did I see flowers?UNL: obj(see(icl>do).@past.@interrogative,
flower(icl>thing).@pl)• Please see the flowers?
UNL: obj(see(icl>do).@past.@request, flower(icl>thing).@pl.@definite)
The Analyser MachhineEnconverter
AnalysisRules
Dictionary
C CC A A
nini+1 ni+2Node List
A
B E
D
C
Node-net
ni-1 ni+3
Strategy for Analysis
• Morphological Analysis
• Syntactico-Semantic Analysis
Analysis of a simple sentences<< A Report of John’s genius reached King’s ears>>
article and noun are combined and attribute@indef is added to the noun.
<<[Report ][of] John’s genius reached king’s ears>>
Right shift to put preposition with the succeeding noun.
<</Report /[of ][John’s] genius reached king’s ears>>
Ram’s being a possessing noun, shift right.
<</Report //of / [John’s] [genius] reached king’s ears>>
These two nouns are resolved into relation pos and first noun is deleted:
Simple sentence (continued)
<</Report /[of][genius] reached King’s ears>>The preposition of is then combined with noun and a dynamic attribute OFRES is added to entry of genius. <<[Report][of genius ] reached King’s ears>>Using the attribute OFRES these two nouns are resolved to relation mod and the second noun is deleted. <<[Report ][reached] King’s ears>>Shift right again and solve King’s ears, relation pof is generated.
<</Report /[reached][ ears]>>Relation obj is generated here and then relation agt is generated between Report and ears <</reached />>
UNL as Interlingua and Language Divergence
(Dave, Parikh, Bhattacharyya, JMT, 2003)
• Stands for the discrepancy in representation due to the inherent characteristics of the languages.
• Syntactic Divergence
• Lexical Semantic Divergence
Issue of free word order jaIma nao caaorI krnaovaalao laD,ko kao laazI sao
maara. jaIma nao laazI sao caaorI krnaovaalao laD,ko kao
maara. caaorI krnaovaalao laD,ko kao jaIma nao laazI sao
maara. caaorI krnaovaalao laD,ko kao laazI sao jaIma nao
maara. laazI sao jaIma nao caaorI krnaovaalao laD,ko kao
maara.• Use made of the fact that in Hindi post positions stay adjacent to nouns
(opposed to the preposition stranding divergence).• Flexibility in parsing- hit and preserve the predicate till the end.
Conjuct and compound verbs Typical Indian language phenomenon. Conjunct for verb-verb,
compound for other POS+verb.vah gaanao lagaI She started singing
H calao jaaAaoGo away.
H $k jaaAao E Stop there.H Jauk jaaAaoE Bend down.Possibility of combinatorial explosion in the lexicon. Possible solution:
wordnet?
Use of Lexical Resources
•Automatic Generation of the UW to language dictionary (Verma and Bhattacharyya, Global Wordnet Conference, Czeck Republic, 2004)
•Universal Word generation•Semantic attribute generation•Heavy use of wordnets and ontologies
Wordnet and Lexical Resources
•Approximately 12000 Hindi synsets corresponding to about 35000 root words of Hindi.•Approximately 7000 Hindi synsets corresponding to about 16000 root words of Hindi.•Verb Hierarchy of approximately 4000 unique words corresponding to 6000 senses.
Gloss
AQyana kxa
Hyponymy
Hyponymy
Aavaasa , inavaasa
Sayana kxa
rsaao[-Gar
Gar , gaRh manauYyaaoM ka
Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO
Aitiqa gaRh
baramada
Aa^Mgana
AaEama
JaaopD,I
saMrcanaa
Meronymy
Hyponymy
Meronymy
Hypernymy
WordNet Sub-Graph
Languages under StudyLanguage Analysis Status Generation
StatusEnglish D- 60000
R- 5000
D- 60000
R- 400
Hindi D- 75000
R- 5700
D- 75000
R- 6500
Marathi D- 4000
R- 2200
D- 4000
R- 6000
Bengali D- 500
R- 1800
D- 500
R- 2100
Conclusions
• Predicate preservation strategy used for English, Hindi, Marathi, Bengali (Spanish being added).
• Focus in marathi on morphology for Marathi.
• Focus on kaarak (case) system for Bengali.
• Extremely lexical knowledge hungry.
Conclusions
• Work going on in the creation of Indian language wordnets (Hindi, Marathi in IIT Bombay; Dravidian in Anna University).
• Interlingua has a the attractive possibility of being used as a knowledge representation and applying to interesting applications like summarization, text clustering, meaning based multilingual search engines.