anusaaraka: an approach to machine translation akshar bharati, vineet chaitanya 1, amba kulkarni 2 1...

82
Anusaaraka: An Approach to Machine Translation Akshar Bharati, Vineet Chaitanya 1 , Amba kulkarni 2 1 Chinmaya International Foundation stationed at Rashtriya Sanskrit Vidyapeetha, Tirupati [email protected] 2 Department of Sanskrit Studies, University of Hyderabad, Hyderabad [email protected]

Upload: byron-nicholson

Post on 26-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Anusaaraka: An Approach to Machine Translation Akshar Bharati, Vineet Chaitanya 1, Amba kulkarni 2 1 Chinmaya International Foundation stationed at Rashtriya Sanskrit Vidyapeetha, Tirupati [email protected] 2 Department of Sanskrit Studies, University of Hyderabad, Hyderabad [email protected]
  • Slide 2
  • Anusaaraka is - An Incremental Machine Translation Layered output Successive layers more and more close to MT
  • Slide 3
  • Slide 4
  • Machine Translation: Current Trends Techniques being used: Statistical Statistical methods: Inherent limitation Can never give a 100% reliable system End user can never be sure about the Correctness. Current MT systems CAN NOT give a system for users who want to ACCESS a text in other languages
  • Slide 5
  • Problems in Machine Translation
  • Slide 6
  • Language codes information only partially. Tension between BREVITY and PRECISION. Brevity wins leading to inherent ambiguity at different levels.
  • Slide 7
  • For example : sigareta pInA -> sigareta ke dhueM kA sevana karanA 'cigarette' 'to drink' ' cigarette' 'gen' 'smoke' 'gen' 'to consume' Shelve the books -> To put the books in the shelf. Table the resolution -> To postpone To present before the audience. rAma phala khAtA hE 'Ram' 'fruits' 'eat+habitual' 'Pres'
  • Slide 8
  • Human beings use World Knowledge, Context, Cultural knowledge and Language conventions to decipher it.
  • Slide 9
  • Anusaaraka Generalizes the problem from TRANSLATION to ACCESS Anusaaraka is a Language Accessor
  • Slide 10
  • What is an Accessor ? Gist Terminal is a concrete example of SCRIPT ACCESSOR (Developed by IIT Kanpur, and marketed by C-DAC) One can access any text in any Indian script through -- enhanced Devanagari script.
  • Slide 11
  • 03/01/0711 For example, the following two Telugu words Can be displayed in enhanced Devanagari script.
  • Slide 12
  • Salient Features Faithful representation Reversibility No loss of information Text in other script is accessible with a little extra training
  • Slide 13
  • The Language Accessor or anusaaraka tries to generalise and apply this philosophy to the problem of language conversion which is several order more complex than that of script conversion.
  • Slide 14
  • Special feature of Anusaaraka Distribute the load between man and machine MachineRote Memory + Logic Man World knowledge, Common sense Cultural knowledge, Domain knowledge,...... However, there is a coupling between the two loads.
  • Slide 15
  • Urdu-Hindi example User needs to learn some features of Urdu script A Typical Urdu text does not contain short vowels Example: a word 'asii' may be read as usii/isii depending on the context
  • Slide 16
  • Anusaaraka is A tool for overcoming language barriers An application of concepts from Panini's ashtadhyayi to contemporary problems. An exploration of the information dynamics in language A better approach for building Machine Translation systems A Workbench for NLP students An opportunity for the masses to be IT contributors rather than mere IT consumers
  • Slide 17
  • 03/01/0717 A tool to overcome language barriers Salient Features: Use "word formula" (an "enriched" gloss) Faithful Representation No loss of information Reversibility Source Text accessible with little extra effort
  • Slide 18
  • Learning a language involves learning Script, Spellings Vocabulary morphology syntax ---- Cultural Background How far does Anusaaraka help?
  • Slide 19
  • 03/01/0719 Reduction in language barrier Script Spellings Vocabulary Morphology syntax learn a few extended symbols ----- learn a few highly ambiguous words ----- learn a few parameters
  • Slide 20
  • 03/01/0720 Claim: Language learning efforts reduced from Few months to few days: In case of Indian languages Few years to few months: In case of English
  • Slide 21
  • If the languages are close enough, then load on user is very less. However, if the incompatibility increases, the load on user also increases. Is there a limit beyond which the load can not be reduced?
  • Slide 22
  • Technology helps in reducing barriers Example: Railway Network: Reduces the distance barrier ==>Time to cover the distance is reduced Since the inception of computers, Machine Translation is being attempted. Can computers help us in reducing the language barrier?
  • Slide 23
  • Measuring language barrier D(L1,L2)=Time taken to learn L2 given that the person knows L1
  • Slide 24
  • MT systems aim at reducing this time to ZERO However, 'Feed a text in one language, and out comes its translation in your mother tongue' is still a distant dream!
  • Slide 25
  • Anusaaraka: A better Approach for Developing Machine Translation Systems
  • Slide 26
  • Anusaaraka Anybody with an aptitude for 'language analysis' can contribute to the development of a Machine Translation system even without any exposure to the formal linguistic training.
  • Slide 27
  • 03/01/0727 Anusaaraka is An Application of concepts from Panini's Ashtadhyayi to contemporary problems pravitti nimitta sannidhi yogyatA AkAMkshA kArakas etc
  • Slide 28
  • English Hindi Anusaaraka : An example
  • Slide 29
  • 03/01/0729 Some facts English: Lingua Franca of Business as well as Scientific Community Hardly 5-10% Indian population understands English well For Indians, English is generally TOUGH and is EVER CHANGING
  • Slide 30
  • 03/01/0730 Range of Needs(of literate Indians) A) Comfortable with simple English, but face difficulty with rare words and their meanings B) Comfortable with simple sentences but face problem with complex verb formations (verb with particles or complex TAM structures etc.) C) Poor knowledge of even common syntactic phenomena of English language, but good analytic skill D) Very poor in English and also weak in analytical skills but are motivated enough to put in hard work and have good stamina to struggle with difficulties E) Poor in English as well as analytical skills and not well motivated to take pains
  • Slide 31
  • 03/01/0731 Problems in Translation Both Human as well as Machine are well-known! Anusaaraka An attempt to cater to these Needs starting from the top level
  • Slide 32
  • 03/01/0732 How Does Anusaaraka Cater to the Diverse Needs? Case-1 accord Before holding a person responsible for a crime and according punishment, the motive behind the action must be determined.
  • Slide 33
  • 03/01/0733 Case-2 blow out Firemen tried their level best to blow out the fire in the oil-well.
  • Slide 34
  • 03/01/0734 Case - 3 On line help |- As --> | - ( / ) |- |-
  • Slide 35
  • 03/01/0735 Case - 4 WSD and Preposition Movement Ideal output would be
  • Slide 36
  • 03/01/0736 Anusaaraka is an application of Information Dynamics Where is the information coded? Position/ case marking/..? How much Information is coded? Kaaraka / thematic role? How is the information coded? Implicit / Explicit?
  • Slide 37
  • Language Conventions Vary for Encoding Information Word level: Labelling and Packaging of concepts Sentence Level: Expressing relations between constituent words Where is the information coded ?
  • Slide 38
  • 03/01/0738 Lexical Gap English Hindi Technical words Determiners === One- OneImEM Many-Onehe,she,it vaha One-Many Uncle cAcA, mAmA, phUphA, mOsA Word Level
  • Slide 39
  • 03/01/0739 Play Light Smoke As bajAnA/khelanA/abhinaya karanA halkA/prakAsha/prakAshita karanA/ dhuAM/dhuAM nikAlanA/ dhuAM nikalanA/dhueM se kAlA karanA/ sigareta pInA jEsA/ke rUpa meM/kyoMki Overlapping Regions EnglishHindi
  • Slide 40
  • 03/01/0740 Sentence Level Where is the information coded? She is_going home _ _ (vaha ghara jA rahI hE) (rAma phala khAtA hE) Rats kill cats
  • Slide 41
  • 03/01/0741 How much information is coded? Eng: Rama asked Shyam to go home. Hnd: rAma ne shyAma ko/se ghara jAne ke liye kahA Tlg: rAmudu tinAni paNdu tiyyadi Hnd: rAma ne jo phala khAya vaha mIThA thA
  • Slide 42
  • How is information coded? (Implicitly or Explicitly?) rAma dUdha pIkara skUla gayA 'Ram' 'milk' 'having drunk' 'school' 'went' Who drank the milk? I want him to go. What do I want? Mohan dropped the melon and burst Who/What burst?
  • Slide 43
  • 03/01/0743 Information Dynamics : An example from English Rats killed cats Missing Accusative Marker ==> Subject Position Sacrosanct Subject Position: Position of Ukta / abhihita (after the transitive verbs)
  • Slide 44
  • No accusative marker
  • Slide 45
  • No accusative Marker Subject Position - V -
  • Slide 46
  • Mirror No accusative Marker Subject Position - V - No accusative Marker Subject Position - V - Mirror ECM Gapping
  • Slide 47
  • No Yes no question Marker
  • Slide 48
  • No Yes no question Marker Subject-Auxiliary Inversion
  • Slide 49
  • No accusative Marker No Yes no question Marker Subject Position - V - Subject-Auxiliary Inversion
  • Slide 50
  • No accusative Marker Subject Position - V - Subject-Auxiliary Inversion Subj Position can not be empty No Yes no question Marker
  • Slide 51
  • No accusative Marker No Yes no question Marker Subject Position - V - Subject-Auxiliary Inversion Dummy It Dummy There Subj Position can not be empty
  • Slide 52
  • No accusative Marker No Yes no question Marker Subject Position - V - Subject-Auxiliary Inversion Dummy It Dummy There Subject raising Tough Movement Subj Position can not be empty
  • Slide 53
  • No accusative Marker No Yes no question Marker Subject Position - V - Subject-Auxiliary Inversion Dummy It Dummy There Subject raising Tough Movement Mirror ECM Gapping Subj Position can not be empty Dummy Do Sannidhi violation
  • Slide 54
  • Information Flow For example He scratched a figure on the rock (made) She scratched the figure on the rock (erased) He went to school (simple past) He went to school everyday (habitual) I want him to go I want a pen to write
  • Slide 55
  • Information Dynamics : Applications Rule Preparation Psuedo Compounds Mirror Principle
  • Slide 56
  • Pseudo Compound For example Simple noun phrase in English ( The black box ) English has post nominal modification ( The man in blue shirt ) Adjectives occur prenominally Adjectives do not inflect Adjectives cannot occur without a noun, unlike Hindi ( lAla ne kAloM ko mArA - ) Adjectives form a separate grammatical category in English
  • Slide 57
  • Mirror Principle English Hindi word order The word order of the predicate in Hindi is exactly the opposite of English I met (the man) in (blue shirt) near (my house) 1 2 3 4 5 6 7 8 9 10 mEM (apane ghara) ke_pAsa (nIlI kamIza) vAle (0 AdamI) se milI 1 9 10 8 6 7 5 3 4 2 This does not work for the adjectives
  • Slide 58
  • Rule Preparation contd Capturing topic, emphasis, focus etc. For example, From where are you coming ? Where are you coming from ? Is the preposition stranded to put place emphasis on 'where' ?
  • Slide 59
  • Rule formation becomes easy with "proper understanding" of "Information dynamics"
  • Slide 60
  • Rule Preparation contd Are 'subject to subject raising', 'tough movement' etc special devices for 'topicalization' ?
  • Slide 61
  • Information Dynamics : Applications For Automatic Word Alignment Match the anusaaraka output at the LWG level with Hindi translation ignoring certain idiosyncratic postpositions such as Hindi 'ne'
  • Slide 62
  • Information Dynamics : Applications Use Anusaaraka for Gradual Progression towards MT Maintaining Reversibility
  • Slide 63
  • Anusaaraka Philosophy No Loss of Information No efforts should go wasted Users contribute towards the development
  • Slide 64
  • 03/01/0764 Anusaaraka Guide lines for developing MT systems: Make complete information available to the user but, do not clutter the scene. Separate resources that can be made reliable, in principle, from those that are, inherently unreliable. Encourage users to participate as developers. contd...
  • Slide 65
  • 03/01/0765 Anusaaraka Guide lines for developing MT systems contd... Provide alternative means to get the information. Develop algorithms for Human beings first, without worrying about the machine. Do not reinvent the wheel. Use existing resources and tools
  • Slide 66
  • Anusaaraka is Robust Clear cut separation of the resources that are in principle reliable from those that involve probabilistic component. Graceful Degradation In case of failures it produces a 'rough' translation. It is not 'rough' in the sense that is not accurate or precise, but in the sense that it requires some human effort to understand the text. (compare with 'rough journey' where you are taken to the destination, but the journey is not comfortable.)
  • Slide 67
  • Anusaaraka is - Completely Transparent The whole process of Machine Translation is transparent even to a layman
  • Slide 68
  • Human Understandable Outputs For example Chunking: Color Scheme Parsed output: Modifier-Modified Tree
  • Slide 69
  • 03/01/0769 Anusaaraka MT Differences Feature Typical MT system Anusaaraka Goal Natural TranslationProvides Access Unit of Input Single Sentence XML Document System Components Morph, POS, Parser, WSD, Generator Same as in MT + User Interface Contd...
  • Slide 70
  • 03/01/0770 Differences contd... Feature Typical MT systemAnusaaraka Approaches ECLECTIC Choose the best Principle Contd... - EBMT, Rule Based, Statistical, Hybrid Ad-Hoc Information Dynamics Guidelines for Linguists No Specific Guidelines First write an algo- rithm for 'Humans'
  • Slide 71
  • 03/01/0771 Consequences MT: Later modules are affected by the errors of the previous modules Anu: Parallel processing ensure that different modules do not interfere MT: Rough is not well defined Hence users may be misled Anu: Well defined 'Roughness'. Theoretically no chances of user getting mislead. Contd...
  • Slide 72
  • 03/01/0772 Consequences contd... MT: User cannot participate in the development process Anu: User can participate in the development activity MT: Linguists end up in repeating some avoidable work Anu: Linguist prepares data only once
  • Slide 73
  • 03/01/0773 Suitable Environment for Contributors If the technology control is in the handful of few, there is a danger of 'mass exploitation' Our understanding of Mahatma Gandhi (Hind Svaraj)
  • Slide 74
  • 03/01/0774 Suitable Environment for Contributors Proper Environment/tools play a crucial Role. Examples of What 'people in general can do' when provided with proper environment/tools --- Wikipedia, ConceptNet
  • Slide 75
  • 03/01/0775 Anusaaraka provides the Right kind of Environment for the people to Contribute at their level best
  • Slide 76
  • 03/01/0776 A Sanskrit Scholar Application of Panini's Grammar to Modern Languages Developing 'Word Formulae' for Polysemous Words (Word Watching) What can I contribute?
  • Slide 77
  • 03/01/0777 An English-Hindi Bilingual English-Hindi Dictionary of words, phrases, idioms Translations of example sentences Rules for Word Sense Disambiguation Enhancing the 'Word Formulae', Developing new 'Word Formulae' What can I contribute?
  • Slide 78
  • 03/01/0778 Good English Background Providing example sentences covering various shades of meaning Providing Idiomatic English constructions Good Hindi Background Point out errors in Generation What can I contribute?
  • Slide 79
  • 03/01/0779 A Computer Scientist Add 'Intelligence' to the User interface to provide relevant 'On line Help'. Provide environment for 'Collaborative' activity -- MediaWiki,... What can I contribute?
  • Slide 80
  • 03/01/0780 In Summary To Discover the sources of information, and the additional resources that are needed may take some time. At this stage, it may not be feasible to provide such a 'Knowledge Base'.
  • Slide 81
  • 03/01/0781 Analogy Solution Provide different views of the 3-D structure: For example: Plan (top view), Elevation (front view), Side view Incommensurity 2D Drawing of a 3D structure
  • Slide 82
  • 03/01/0782 Analogy.. Incommensurity Among Languages Solution Provide glosses from different views Word Sense Disambiguation, movement of function words, Descrambling