natural language processing - resource centre for indian ...cfilt.pdf · indian statistical...

Natural Language Processing:at IIT Bombay

Raj Dabre

Computer Science and Engineering Department

IIT Bombay

[email protected]

Guide: Prof Pushpak Bhattacharyya

17th July 2013

mailto:[email protected]

Multilingual Computation

Multilingual and Multidocument Summarization

Cross Lingual Search

Multilingual Sentiment Analysis

Machine Translation

Linguistics is the eye and computation thebody

NLP is ubiquitous

Multilinguality: Indian situation

Major streams

Indo European

Dravidian

Sino Tibetan

Austro-Asiatic (Meghalaya)

Some languages are ranked within 20 in the world in terms of the populations speaking them

Hindi and Urdu: 5th (~500 million)

Bangla: 7th (~300 million)

Marathi 14th (~70 million)

(Image Taken from the web)

Motivation: cross the language barrierEnglish still the most dominant

language on the web

Contributes 72% of the content

Number of non-English users steadily rising all over the world

English penetration in India

Estimated to be around 3-4%

Mostly the urban educated class

Need to enable access to above information through local languages

Facts and figures about NLP@IITB

Highly visible nationally, internationally

5 associated faculty: 3 CSE + 2 HSS

PhD students: graduated-13; ongoing-9

M.Tech students: graduated-92; ongoing-11

Publications in highly visible fora: ACL, COLING, WWW, ECML, EMNLP

Sponsorship: Ministry of IT, Yahoo, IBM, Microsoft, HP Labs, Xerox, AOL (funding of approx Rs 10 crores) (16,120,090.00 Japanese Yen)

Technology developed used by search engine companies.

Organizing major international conferences: COLING 2012 (8-16 DEC, 2012)

Associations with international universities (Copenhagen, Grenoble, Kyoto etc.)

Some specific projects

http://www.cfilt.iitb.ac.in/


Indowordnet

Cross Lingual Information Retrieval/Access

Indian language Machine Translation

Sentiment Analysis

Text Entailment

Word Sense Disambiguation

Universal Networking Language

Thwarting Detection

Some initiatives with corporate companies

Yahoo India

Xerox

Crimson and EzDi

Eye tracking

Hindi Wordnet

Dravidian Language Wordnet

North East Language Wordnet

Marathi Wordnet

Sanskrit Wordnet

EnglishWordnet

Bengali Wordnet

Punjabi Wordnet

KonkaniWordnet

UrduWordnet

Gujarati Wordnet

Oriya Wordnet

Kashmiri Wordnet

INDOWORDNET

INDOWORDNET

8+ Linguists at IITB.

Work on populating Hindi and Marathi and Sanskrit wordnets.

Adding relations between synsets (Hypernymy, Hyponymy, Ontology information etc)

Developing and providing API for Hindi Wordnet

Synset linking and cross linking

Corpus annotation

INDOWORDNET

Salil Joshi, Arindam Chatterjee, Arun Karthikeyan Karra and Pushpak Bhattacharyya, Eating your own cooking: An on-line heuristic-based wordnet linking system using previously linked synsets, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (demo paper)

Arindam Chatterjee, Salil Joshii, Pushpak Bhattacharyya, Diptesh Kanojia and Akhlesh Meena, A Study of the Sense Annotation Process: Man v/s Machine, International Conference on Global Wordnets (GWC 2011), Matsue, Japan,, Jan, 2012.

Anuja Ajotikar, Malhar Kulkarni and Pushpak Bhattacharyya, Verbal Roots in Sanskrit Wordnet, International Conference on Global Wordnets (GWC 2011), Matsue, Japan,, Jan, 2012.

http://www.cse.iitb.ac.in/~pb/papers/coling12-wn-linking.pdf




http://www.cse.iitb.ac.in/~pb/papers/gwc12-man-machine.pdf

http://www.cse.iitb.ac.in/~pb/papers/gwc12-swn-verb.pdf

http://www.cse.iitb.ac.in/~pb/papers/gwc12-swn-verb.pdf

Cross Language Information Retrieval (CLIR)

Crawled and Indexed

Web Pages

Target Informationin English

ििििििि िििििि

Hindi Query

CLIR Engine

Target Language Indexin English

Ranked List of Results

Language Resources

ििििििि ििि िि ििि ििि िििि

ितरपित पुणय नगर पहुँचने के िलए बहुत रेल उपलबध है | अगर मुंबई से यातरा कर रहे है तो मुंबई- चेनई एकसपरेस गाडी से

परवास कर सकते है |

ििििििि िििििि

Result Snippetsin Hindi

Cross Language Information Access (CLIA) Consortia Project

Indian Language CLIR Engine under development

Input – 9 Indian Languages (Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Assamese, Oriya and Punjabi)

Output – Hindi, English and Input Language of Query Domains – Tourism (Current Release)

Involves 10 academic institutes all over the country: IITs, Indian Statistical Institute, CDAC, Anna University, Guahati University, DAICT, IIIT Bhubaneshwar, Jadavpur University

IIT Bombay – Overall co-ordinator

http://www.tdil-dc.in/sandhan/locale.jsp?hi



Cross Language Information Access (CLIA) Consortia Project

Sanjeet Khaitan, Kamaljeet Verma and Pushpak Bhattacharyya, Exploiting Semantic Proximity for Information Retrieval, IJCAI 2007, Workshop on Cross Lingual Information Access, Hyderabad, India, Jan, 2007.

Satish Kagathara, Manish Deodalkar and Pushpak Bhattacharyya, A Multistage Search Strategy for Cross Lingual Information Retrieval, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.

Arjun Atreya, Swapnil Chaudhury, Pushpak Bhattacharyya and Ganesh Ramakrishnan, Building Multilingua Searh Index Using Open Source Framework, 3rd Workshop on South East and South Asian NLP, part of COLING 2012, Mumbai, India, 8 Dec, 2012

Swapnil Chaudhury, Arjun Atreya, Pushpak Bhattacharyya and Ganesh Ramakrishnan, Error Tracking in Search Engine Development, 3rd Workshop on South East and South Asian NLP, part of COLING 2012, Mumbai, India, 8 Dec, 2012

http://www.cse.iitb.ac.in/~pb/papers/IJCAI-CLIA-Exploiting-Semantics.pdf

http://www.cse.iitb.ac.in/~pb/papers/multistage-search-agro.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling12-wssanlp-multilingual-index.pdf





http://www.cse.iitb.ac.in/~pb/papers/coling12-wssanlp-clia-tracker.pdf

India wide activity on Machine Translation

ILILMT Systems

Indian Language to Indian Language MT

ILMT_Tam-Hin for Tamil Hindi

ILMT_Tel-Hin for Telugu Hindi

ILMT_Mar-Hin for Marathi Hindi

ILMT_Ben-Hin for Bangala Hindi

ILMT_Tam-Tel for Tamil Telugu

ILMT_Urd-Hin for Urdu Hindi

ILMT_Pun-Hin for Punjabi Hindi

ILMT_Mal-Tam for Malayalam Tamil

ILMT Kan-Hin for Kannada Hindi

http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterface/admin/

http://www.cfilt.iitb.ac.in/~ilmt/ilmtinterface/admin/

Objectives

To produce and deliver high quality hybrid machine translation systems for 9 Indian Languages

In the process develop high quality tools for linguistic analyses like:

Morphological Analyzers

Part of Speech Taggers etc

Architecture of sampark

Generation

17

(Image Taken from MT report by Raj Dabre)

Example

I am Raj

I <pron, first person, singular, Noun Phrase> am<vcop, present tense, Verb Phrase> Raj<np, singular, Noun Phrase>

I <pron> + am <vcop> = 私は

Raj <np> =ラジ

Am<present tense> =です

Rules given by linguists are used.

Challenges

• Adequacy = Meaning , Fluency = Well Formedness

• Currently 52% and 20%

• Score = (S5+0.8*S4+0.6*S3)/Total (IIITH)

• Adequacy depends on high quality bilingual lexicon (currently have only 35000 roots; need 90000)

• Adequacy also depends on good WSD.

• Fluency depends on FWD.

• Observed that dependency relations need to be identified properly.

• Language divergence needs to be studied.o Marathi - Hindi have more divergence than it seems

(linguists claims)

Publications

Raj Dabre, Archana Amberkar and Pushpak Bhattacharyya, Morphology Analyser for Affix Stacking Languages: a case study in Marathi, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (poster paper)

Harshada Gune, Mugdha Bapat, Mitesh Khapra and Pushpak Bhattacharyya, Verbs are where all the Action Lies: Experiences of Shallow Parsing of a Morphologically Rich Language, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.

Kuhoo Gupta, Manish Shrivastava, Smriti Singh and Pushpak Bhattacharyya, Morphological Richness Offsets Resource Poverty- an Experience in Building a POS Tagger for Hindi, COLING/ACL-2006, Sydney, Australia, July, 2006

http://www.cse.iitb.ac.in/~pb/papers/coling12-marathi-morph-analyser.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling10-marathi-shallow-parsing.pdf

http://www.cse.iitb.ac.in/~pb/papers/ACL-2006-Hindi-POS-Tagging.pdf

English to Indian Language MT

– I want water

मला पाणी पािहजे . {Mala Paani Pahije}

Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil

Approaches: Statistical MT, Example Based MT

Members: CDAC Pune (c), IIT Bombay, IIITH, IIITA

Two directions

Tree Adjoining Grammar based

Statistical MT based

English to Indian Language MT

Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh and Pushpak Bhattacharyya, Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT, ACL-IJCNLP 2009, Singapore, August, 2009.

R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M. Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.

http://www.cse.iitb.ac.in/~pb/papers/acl09-smt.pdf





http://www.cse.iitb.ac.in/~pb/papers/icon07-bleu.pdf

Future MT

Intention for MT in foreign languages other than English.

Hindi-Japanese: May be interesting.

Same sentential ordering

Comparable morphology

Almost no parallel corpus.

Sentiment Analysis

• General Idea:o Given a document identify whether it has a

positive or negative sentiment. (Basic aim)o Many web pages have tons of user reviews

and comments.o Need to classify pages based on the

sentiment of the content.

• YouCat, TwitSent, Document polarity detection.

Sentiment Analysis

• Subhabrata Mukherjee and Pushpak Bhattacharyya, Sentiment Analysis in Twitter with Lightweight Discourse Analysis, Computational Linguistics Conference (COLING 2012), Mumbai, 10-14 Dec, 2012

• Kashyap Popat, Balamurali A.R, Pushpak Bhattacharyya and Gholamreza Haffari, The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (long paper)

• Subhabrata Mukherjee and Pushpak Bhattacharyya, YouCat : Weakly Supervised Youtube Video Categorization System from Meta Data & User Comments using WordNet & Wikipedia, COLING 2012, Mumbai, 10-14 Dec, 2012

http://www.cse.iitb.ac.in/~pb/papers/coling12-discourse-sa.pdf

http://www.cse.iitb.ac.in/~pb/papers/acl13-sa.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling12-YouCat.pdf






Text Entailment

• General Idea:o Given 2 sentences identify whether the first implies

the other.o Eg: "One of my teeth fell" and "I have one less tooth“o “Obama was re-elected as the president” and

“Obama is the current President of USA”

• Mostly use UNL and Dependency Parsing along with rules to determine causality.

WSD

• General idea:o Given a sentence with polysemous words identify

the appropriate sense of each word.o Eg: I bank<depend> on the bank<money> by the

bank<river>. {I depend on the money bank by the river bank}

o Primary aim is to estimate P(sense/word) from an annotated or unannotated corpus.

o Secondary aim is to estimate “sense” given a “word” occurence.

• IWSD, Unsupervised WSD etc.

WSD

Sudha Bhingardive, Samiulla Shaikh and Pushpak Bhattacharyya, Neighbor Help: Bilingual Unsupervised WSD Using Context, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)

Salil Joshi, Mitesh M. Khapra and Pushpak Bhattacharyya, I Can Sense It: a comprehensive online system for WSD, COLING 2012, Mumbai, India, 10-14 Dec, 2012 (demo paper)

Mitesh Khapra, Salil Joshi, Arindam Chatterjee and Pushpak Bhattacharyya, Together We Can: Bilingual Bootstrapping for WSD, Annual Meeting of the Association of Computational Linguistics (ACL 2011), Oregon, USA, June 2011.

Mitesh Khapra, Saurabh Sohoney, Anup Kulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010.

http://www.cse.iitb.ac.in/~pb/papers/acl13-wsd.pdf

http://www.cse.iitb.ac.in/~pb/papers/acl13-wsd.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling12-wsd-demo.pdf

http://www.cse.iitb.ac.in/~pb/papers/acl11-bilingual-bootstrapping.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling10-language-adptation.pdf

The Vauquois triangle.

Gap between languages (Image Taken from the web)

UNL

• Universal Networking Language

• General Idea:o Generate UNL representation of input English

sentence using Enconvertero UNL is a language independent representation of a

sentenceo Top of the Vauquois Triangleo Needs high amount of analysis

UNL

John worked specially for the social fund.

Image from: English to UNL (Interlingua) Enconversion, Manoj Jain and Om P. Damani, LTC 09

UNL

Avishek Dan and Pushpak Bhattacharyya, CFILT-CORE: Finding Semantic Textual Similarity using UNL, *SEM workshop shared task on Text Similarity, part of NAACL HLT 2013, Atalnata., USA, 14-15 June, 2013

Janardhan Singh, Arindam Bhattacharyya and Pushpak Bhattacharyya, Janardhan: Semantic Textual Similarity Using Universal Networking Language Graph Matching, Starsem workshop, Part of NAACL 2012, Montreal, Canada, 7-8 June, 2012

Rajat Mohanty, Ashish Almeida and Pushpak Bhattacharyya, Prepositional Phrase Attachment and Interlingua, International Conference on Intelligent Text Processing and Computational Linguistics (CCLING-2005) Workshop on UNL and other Interlingua and their Applications, Mexico City, Mexico, February, 2005

http://www.cse.iitb.ac.in/~pb/papers/naacl13-starsem-text-similarity.pdf



http://www.cse.iitb.ac.in/~pb/papers/CCLING-UNL-Workshop-Feb05.pdf

Thwarting• General Idea:

o Given a document identify the content which is misleading

o Need such processing to detect and avoid spammed and trolled web pages

o Eg:o Camera has good specs, excellent battery life, but I

did not like it.o Camera has good specs and the purchase was

worth it o Document polarity analysis at discourse level is

needed.

Thwarting

Ankit Ramteke, Akshat Malu, Pushpak Bhattacharyya and Saketha Nath, Detecting Turnarounds in Sentiment Analysis: Thwarting, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)

http://www.cse.iitb.ac.in/~pb/papers/acl13-thwarting.pdf

Development of Multilingual Resources and Technologies for Indian Languages

(A Xerox-IITB initiative)• Crowdsourcing for generating parallel translation for Statistical Machine Translation

• Annotation Machine Learning Correction: Cycle

• Automatic means of translation error detection and correction

Crowdsourcing Publications

Anoop Kunchukuttan, Rajen Chatterjee, Shourya Roy, Abhijit Mishra and Pushpak Bhattacharyya, TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (demo paper)

http://www.cse.iitb.ac.in/~pb/papers/acl13-demo-crowdsourcing-system.pdf




Yahoo-IITB initiative on MT

Domain specific high quality MT

Statistical

Languages: Hindi, Marathi, English

Headlines

Cricket commentary

Recent internship: Rahul Sharnagat: URL translation

Translate from English to Hindi

Eg:

● www.xyz.com/magic

● www.xyz.com/魔法

Translation and Transliteration

Recent industry collaboration

Collaboration with Crimson

Automatically remove mistakes in text

Used in technical writing to assist in easy writing of technical texts

Collaboration with EzDI

Patient condition monitoring from transcripts

● Patient diagnosed with fever

● Medicine given

● Fever 38 degrees C

● Fever 37 degrees C

– Conclusion: Medicine is effective and patient is recovering.

Cognitive studies through Eye-Trackingat CFILT

Eyes are the window to the soul

(Image Taken from the web)

Motivation

● Approaches for various current NLP tasks can be classified as weak AI systems.

● According to the classical definition, a strong AI based system should perform any language processing task in the same manner and with similar accuracy as human being

● Detailed understanding of human way of language processing is necessary.

● Human language processing governed by cognitive processes inside the brain

– Correlates with the movement of eyes during analysis and synthesis of text.

● Aim to study gaze behaviour of subjects for different tasks through different experimental setups.

Ongoing work

● Study of word sense annotation through Eye-Tracking.

– Which type of words do humans spend more time on during sense disambiguation? How much does discourse help?

● Translation Process Research.

– What makes the text difficult to translate.

– Measurement of reordering effort.

● Reader sentiment analysis.

– What separates objective texts from sentiment texts.

– No-goal reading VS sentiment-goal reading.

Remote Eye-Trackers used at CFIILT

● Tobii TX300 (Left) and SMI Red 500 (Right)

● These devices allow free head movement to some extent.

● Allow for addition of modules to process eyetracking data.

● Very Expensive.

(Images Taken from the web)

Eye Tracking Publications

Abhijit Mishra and Pushpak Bhattacharyya, Automatically Predicting Sentence Translation Difficulty, ACL 2013, Sofia, Bulgaria, 4-9 August, 2013 (short paper)

Abhijit Mishra, Michael Carl and Pushpak Bhattacharyya, A Heuristic Based Approach for Systematic Error Correction of Gaze Data for Reading, First Workshop on Eye Tracking and NLP, part of COLING 2012, Mumbai, India, 15 Dec, 2012

http://www.cse.iitb.ac.in/~pb/papers/acl13-eye-tracking.pdf

http://www.cse.iitb.ac.in/~pb/papers/coling12-eyetracking-gaze-correction.pdf

Thank you

Resources: http://www.cfilt.iitb.ac.inPublications: http://www.cse.iitb.ac.in/~pb


http://www.cse.iitb.ac.in/~pb

natural language processing - resource centre for indian ...cfilt.pdf · indian statistical...

Documents