natural language processing - resource centre for indian ...cfilt.pdf · indian statistical...
TRANSCRIPT
Natural Language Processing:at IIT Bombay
Raj Dabre
Computer Science and Engineering Department
IIT Bombay
Guide: Prof Pushpak Bhattacharyya
17th July 2013
Multilingual Computation
Multilingual and Multidocument Summarization
Cross Lingual Search
Multilingual Sentiment Analysis
Machine Translation
Linguistics is the eye and computation thebody
NLP is ubiquitous
Multilinguality: Indian situation
Major streams
Indo European
Dravidian
Sino Tibetan
Austro-Asiatic (Meghalaya)
Some languages are ranked within 20 in the world in terms of the populations speaking them
Hindi and Urdu: 5th (~500 million)
Bangla: 7th (~300 million)
Marathi 14th (~70 million)
(Image Taken from the web)
Motivation: cross the language barrierEnglish still the most dominant
language on the web
Contributes 72% of the content
Number of non-English users steadily rising all over the world
English penetration in India
Estimated to be around 3-4%
Mostly the urban educated class
Need to enable access to above information through local languages
Facts and figures about NLP@IITB
Highly visible nationally, internationally
5 associated faculty: 3 CSE + 2 HSS
PhD students: graduated-13; ongoing-9
M.Tech students: graduated-92; ongoing-11
Publications in highly visible fora: ACL, COLING, WWW, ECML, EMNLP
Sponsorship: Ministry of IT, Yahoo, IBM, Microsoft, HP Labs, Xerox, AOL (funding of approx Rs 10 crores) (16,120,090.00 Japanese Yen)
Technology developed used by search engine companies.
Organizing major international conferences: COLING 2012 (8-16 DEC, 2012)
Associations with international universities (Copenhagen, Grenoble, Kyoto etc.)
Indowordnet
Cross Lingual Information Retrieval/Access
Indian language Machine Translation
Sentiment Analysis
Text Entailment
Word Sense Disambiguation
Universal Networking Language
Thwarting Detection
Some initiatives with corporate companies
Yahoo India
Xerox
Crimson and EzDi
Eye tracking
Hindi Wordnet
Dravidian Language Wordnet
North East Language Wordnet
Marathi Wordnet
Sanskrit Wordnet
EnglishWordnet
Bengali Wordnet
Punjabi Wordnet
KonkaniWordnet
UrduWordnet
Gujarati Wordnet
Oriya Wordnet
Kashmiri Wordnet
INDOWORDNET
INDOWORDNET
8+ Linguists at IITB.
Work on populating Hindi and Marathi and Sanskrit wordnets.
Adding relations between synsets (Hypernymy, Hyponymy, Ontology information etc)
Developing and providing API for Hindi Wordnet
Synset linking and cross linking
Corpus annotation
INDOWORDNET
Salil Joshi, Arindam Chatterjee, Arun Karthikeyan Karra and Pushpak Bhattacharyya, Eating your own cooking: An on-line heuristic-based wordnet linking system using previously linked synsets, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (demo paper)
Arindam Chatterjee, Salil Joshii, Pushpak Bhattacharyya, Diptesh Kanojia and Akhlesh Meena, A Study of the Sense Annotation Process: Man v/s Machine, International Conference on Global Wordnets (GWC 2011), Matsue, Japan,, Jan, 2012.
Anuja Ajotikar, Malhar Kulkarni and Pushpak Bhattacharyya, Verbal Roots in Sanskrit Wordnet, International Conference on Global Wordnets (GWC 2011), Matsue, Japan,, Jan, 2012.
Cross Language Information Retrieval (CLIR)
Crawled and Indexed
Web Pages
Target Informationin English
ििििििि िििििि
Hindi Query
CLIR Engine
Target Language Indexin English
Ranked List of Results
Language Resources
ििििििि ििि िि ििि ििि िििि
ितरपित पुणय नगर पहुँचने के िलए बहुत रेल उपलबध है | अगर मुंबई से यातरा कर रहे है तो मुंबई- चेनई एकसपरेस गाडी से
परवास कर सकते है |
ििििििि िििििि
Result Snippetsin Hindi
Cross Language Information Access (CLIA) Consortia Project
Indian Language CLIR Engine under development
Input – 9 Indian Languages (Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Assamese, Oriya and Punjabi)
Output – Hindi, English and Input Language of Query Domains – Tourism (Current Release)
Involves 10 academic institutes all over the country: IITs, Indian Statistical Institute, CDAC, Anna University, Guahati University, DAICT, IIIT Bhubaneshwar, Jadavpur University
IIT Bombay – Overall co-ordinator
http://www.tdil-dc.in/sandhan/locale.jsp?hi
Cross Language Information Access (CLIA) Consortia Project
Sanjeet Khaitan, Kamaljeet Verma and Pushpak Bhattacharyya, Exploiting Semantic Proximity for Information Retrieval, IJCAI 2007, Workshop on Cross Lingual Information Access, Hyderabad, India, Jan, 2007.
Satish Kagathara, Manish Deodalkar and Pushpak Bhattacharyya, A Multistage Search Strategy for Cross Lingual Information Retrieval, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.
Arjun Atreya, Swapnil Chaudhury, Pushpak Bhattacharyya and Ganesh Ramakrishnan, Building Multilingua Searh Index Using Open Source Framework, 3rd Workshop on South East and South Asian NLP, part of COLING 2012, Mumbai, India, 8 Dec, 2012
Swapnil Chaudhury, Arjun Atreya, Pushpak Bhattacharyya and Ganesh Ramakrishnan, Error Tracking in Search Engine Development, 3rd Workshop on South East and South Asian NLP, part of COLING 2012, Mumbai, India, 8 Dec, 2012
India wide activity on Machine Translation
ILILMT Systems
Indian Language to Indian Language MT
ILMT_Tam-Hin for Tamil Hindi
ILMT_Tel-Hin for Telugu Hindi
ILMT_Mar-Hin for Marathi Hindi
ILMT_Ben-Hin for Bangala Hindi
ILMT_Tam-Tel for Tamil Telugu
ILMT_Urd-Hin for Urdu Hindi
ILMT_Pun-Hin for Punjabi Hindi
ILMT_Mal-Tam for Malayalam Tamil
ILMT Kan-Hin for Kannada Hindi
http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterface/admin/
Objectives
To produce and deliver high quality hybrid machine translation systems for 9 Indian Languages
In the process develop high quality tools for linguistic analyses like:
Morphological Analyzers
Part of Speech Taggers etc
Architecture of sampark
Generation
17
(Image Taken from MT report by Raj Dabre)
Example
I am Raj
I <pron, first person, singular, Noun Phrase> am<vcop, present tense, Verb Phrase> Raj<np, singular, Noun Phrase>
I <pron> + am <vcop> = 私は
Raj <np> =ラジ
Am<present tense> =です
Rules given by linguists are used.
Challenges
• Adequacy = Meaning , Fluency = Well Formedness
• Currently 52% and 20%
• Score = (S5+0.8*S4+0.6*S3)/Total (IIITH)
• Adequacy depends on high quality bilingual lexicon (currently have only 35000 roots; need 90000)
• Adequacy also depends on good WSD.
• Fluency depends on FWD.
• Observed that dependency relations need to be identified properly.
• Language divergence needs to be studied.o Marathi - Hindi have more divergence than it seems
(linguists claims)
Publications
Raj Dabre, Archana Amberkar and Pushpak Bhattacharyya, Morphology Analyser for Affix Stacking Languages: a case study in Marathi, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (poster paper)
Harshada Gune, Mugdha Bapat, Mitesh Khapra and Pushpak Bhattacharyya, Verbs are where all the Action Lies: Experiences of Shallow Parsing of a Morphologically Rich Language, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.
Kuhoo Gupta, Manish Shrivastava, Smriti Singh and Pushpak Bhattacharyya, Morphological Richness Offsets Resource Poverty- an Experience in Building a POS Tagger for Hindi, COLING/ACL-2006, Sydney, Australia, July, 2006
English to Indian Language MT
– I want water
मला पाणी पािहजे . {Mala Paani Pahije}
Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil
Approaches: Statistical MT, Example Based MT
Members: CDAC Pune (c), IIT Bombay, IIITH, IIITA
Two directions
Tree Adjoining Grammar based
Statistical MT based
English to Indian Language MT
Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya, Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT, ACL-IJCNLP 2009, Singapore, August, 2009.
R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M. Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.
Future MT
Intention for MT in foreign languages other than English.
Hindi-Japanese: May be interesting.
Same sentential ordering
Comparable morphology
Almost no parallel corpus.
Sentiment Analysis
• General Idea:o Given a document identify whether it has a
positive or negative sentiment. (Basic aim)o Many web pages have tons of user reviews
and comments.o Need to classify pages based on the
sentiment of the content.
• YouCat, TwitSent, Document polarity detection.
Sentiment Analysis
• Subhabrata Mukherjee and Pushpak Bhattacharyya, Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Computational Linguistics Conference (COLING 2012), Mumbai, 10-14 Dec, 2012
• Kashyap Popat, Balamurali A.R, Pushpak Bhattacharyya and Gholamreza Haffari, The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (long paper)
• Subhabrata Mukherjee and Pushpak Bhattacharyya, YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data & User Comments using WordNet & Wikipedia, COLING 2012, Mumbai, 10-14 Dec, 2012
Text Entailment
• General Idea:o Given 2 sentences identify whether the first implies
the other.o Eg: "One of my teeth fell" and "I have one less tooth“o “Obama was re-elected as the president” and
“Obama is the current President of USA”
• Mostly use UNL and Dependency Parsing along with rules to determine causality.
WSD
• General idea:o Given a sentence with polysemous words identify
the appropriate sense of each word.o Eg: I bank<depend> on the bank<money> by the
bank<river>. {I depend on the money bank by the river bank}
o Primary aim is to estimate P(sense/word) from an annotated or unannotated corpus.
o Secondary aim is to estimate “sense” given a “word” occurence.
• IWSD, Unsupervised WSD etc.
WSD
Sudha Bhingardive, Samiulla Shaikh and Pushpak Bhattacharyya, Neighbor Help: Bilingual Unsupervised WSD Using Context, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)
Salil Joshi, Mitesh M. Khapra and Pushpak Bhattacharyya, I Can Sense It: a comprehensive online system for WSD, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (demo paper)
Mitesh Khapra, Salil Joshi, Arindam Chatterjee and Pushpak Bhattacharyya, Together We Can: Bilingual Bootstrapping for WSD, Annual Meeting of the Association of Computational Linguistics (ACL 2011), Oregon, USA, June 2011.
Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.
The Vauquois triangle.
Gap between languages (Image Taken from the web)
UNL
• Universal Networking Language
• General Idea:o Generate UNL representation of input English
sentence using Enconvertero UNL is a language independent representation of a
sentenceo Top of the Vauquois Triangleo Needs high amount of analysis
UNL
John worked specially for the social fund.
Image from: English to UNL (Interlingua) Enconversion, Manoj Jain and Om P. Damani, LTC 09
UNL
Avishek Dan and Pushpak Bhattacharyya, CFILT-CORE: Finding Semantic Textual Similarity using UNL, *SEM workshop shared task on Text Similarity, part of NAACL HLT 2013, Atalnata., USA, 14-15 June, 2013
Janardhan Singh, Arindam Bhattacharyya and Pushpak Bhattacharyya, Janardhan: Semantic Textual Similarity Using Universal Networking Language Graph Matching, Starsem workshop, Part of NAACL 2012, Montreal, Canada, 7-8 June, 2012
Rajat Mohanty, Ashish Almeida and Pushpak Bhattacharyya, Prepositional Phrase Attachment and Interlingua, International Conference on Intelligent Text Processing and Computational Linguistics (CCLING-2005) Workshop on UNL and other Interlingua and their Applications, Mexico City, Mexico, February, 2005
Thwarting• General Idea:
o Given a document identify the content which is misleading
o Need such processing to detect and avoid spammed and trolled web pages
o Eg:o Camera has good specs, excellent battery life, but I
did not like it.o Camera has good specs and the purchase was
worth it o Document polarity analysis at discourse level is
needed.
Thwarting
Ankit Ramteke, Akshat Malu, Pushpak Bhattacharyya and Saketha Nath, Detecting Turnarounds in Sentiment Analysis: Thwarting, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)
Development of Multilingual Resources and Technologies for Indian Languages
(A Xerox-IITB initiative)• Crowdsourcing for generating parallel translation for Statistical Machine Translation
• Annotation Machine Learning Correction: Cycle
• Automatic means of translation error detection and correction
Crowdsourcing Publications
Anoop Kunchukuttan, Rajen Chatterjee, Shourya Roy, Abhijit Mishra and Pushpak Bhattacharyya, TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (demo paper)
Yahoo-IITB initiative on MT
Domain specific high quality MT
Statistical
Languages: Hindi, Marathi, English
Headlines
Cricket commentary
Recent internship: Rahul Sharnagat: URL translation
Translate from English to Hindi
Eg:
● www.xyz.com/magic
● www.xyz.com/魔法
Translation and Transliteration
Recent industry collaboration
Collaboration with Crimson
Automatically remove mistakes in text
Used in technical writing to assist in easy writing of technical texts
Collaboration with EzDI
Patient condition monitoring from transcripts
● Patient diagnosed with fever
● Medicine given
● Fever 38 degrees C
● Fever 37 degrees C
– Conclusion: Medicine is effective and patient is recovering.
Cognitive studies through Eye-Trackingat CFILT
Eyes are the window to the soul
(Image Taken from the web)
Motivation
● Approaches for various current NLP tasks can be classified as weak AI systems.
● According to the classical definition, a strong AI based system should perform any language processing task in the same manner and with similar accuracy as human being
● Detailed understanding of human way of language processing is necessary.
● Human language processing governed by cognitive processes inside the brain
– Correlates with the movement of eyes during analysis and synthesis of text.
● Aim to study gaze behaviour of subjects for different tasks through different experimental setups.
Ongoing work
● Study of word sense annotation through Eye-Tracking.
– Which type of words do humans spend more time on during sense disambiguation? How much does discourse help?
● Translation Process Research.
– What makes the text difficult to translate.
– Measurement of reordering effort.
● Reader sentiment analysis.
– What separates objective texts from sentiment texts.
– No-goal reading VS sentiment-goal reading.
Remote Eye-Trackers used at CFIILT
● Tobii TX300 (Left) and SMI Red 500 (Right)
● These devices allow free head movement to some extent.
● Allow for addition of modules to process eyetracking data.
● Very Expensive.
(Images Taken from the web)
Eye Tracking Publications
Abhijit Mishra and Pushpak Bhattacharyya, Automatically Predicting Sentence Translation Difficulty, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)
Abhijit Mishra, Michael Carl and Pushpak Bhattacharyya, A Heuristic Based Approach for Systematic Error Correction of Gaze Data for Reading, First Workshop on Eye Tracking and NLP, part of COLING 2012, Mumbai, India, 15 Dec, 2012
Thank you
Resources: http://www.cfilt.iitb.ac.inPublications: http://www.cse.iitb.ac.in/~pb