pushpak bhattacharyya cse dept. iit bombay 19 may, 2014

19
Next Generation Information Extraction (NGIE) in Multilingual Environment (collaborative project with TCS) Pushpak Bhattacharyya CSE Dept. IIT Bombay 19 May, 2014

Upload: brian-gray

Post on 17-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Next Generation Information Extraction (NGIE) in Multilingual Environment

(collaborative project with TCS)

Pushpak BhattacharyyaCSE Dept.IIT Bombay19 May, 2014

NGIE Project Overview

The goal of the project is to develop Next Generation Information Extraction Technology

The IE environment will be multi lingual

Involves Machine Translation and Cross Lingual Search

The IE focus is on relation extraction, named entity extraction, multi word extraction, semantic role labeling, corpus management

Relation and name extraction are to be jointly done since they are synergistic. (CEO_of is a relation between Person and Organization)

The fruits of this research is to be carried to TCS IE environment called INX

High quality publications in IE, in all the above tasks and combinations thereof

Principal Investigators and other members1. Mr. Girish Palshikar,Principal Scientist,Systems Research Lab,Tata Consultancy Services Limited

2. Dr. Pushpak Bhattacharyya,Professor, Department of Computer Science & engineering,IIT Bombay

3. Other members- Rohit Bangera, Sachin Pawar, Rudra Murthy, Girish Ponkia, Ravi Soni, Manish Shrivastava, Diptesh Kanojia, Gajanan Rane

NGIE Project accomplishments (1/3)

1. Relation Extraction: A relation extraction system has been built which can extract entities from natural language sentence and identify relationships among them. Following papers have been published:

Sachin Pawar, Pushpak Bhattacharyya and Girish Keshav Palshikar, Semi-supervised Relation Extraction using EM Algorithm, International Conference on NLP (ICON 2013), Noida, India, 18-20 December, 2013

Sachin Pawar, Pushpak Bhattacharyya and Girish Keshav Palshikar, Improving Relation Extraction Using A Joint Model of Entities and Relations , under revision.

Relation Extraction: Joint Model of Entities and Relations

E1, E2 : Types of the first and second entity mentions

R : Type of the Relation between two mentions F : Feature Vector capturing characteristics of the

entity mentions and how they occur in the sentence Can be used in-

Semi-supervised mode : F, E1,E2 known, R unknown, EM algorithm is used for learning the model parameters.

Supervised mode : F, E1, E2 and R are known while learning

Relation Extraction: Example Input sentence : Patricia Newell, an organizer for

Nader at the University of Florida in Gainesville, said that Nader had won far fewer votes in Florida than his supporters had expected.

Entity Mentions Extracted : PERSON - Patricia Newell, organizer, Nader, Nader, his, supporters

ORGANIZATION - University of Florida GPE (Geo-Political Entity) – Gainesville, Florida

Relations Extracted :

Relation

Entity Mention 1 Entity Mention 2

PER-SOC

organizer Nader

GPE-ORG

University of Florida Gainesville

PER-SOC

his supporters

NGIE Project accomplishments (2/3)

2. Multiword Extraction: Identifying and Extracting multi words using deep learning (multilayered neural networks)

Paper submitted to COLING 2014 (Ireland):

Rahul Sharnagat, Rudra Murthy V, Dhirendra Singh, Pushpak Bhattacharyya,  Identification of Multiword Named Entities using Co-occurrence Statistics and Distributed Word Representation.

MWE situations

• Machine Translationo यू�क्रे� न की� से�न न� क्रे�मि�यू की� से��वर्ती� इलाकी� �� अपन

डे�रा� डे�ल दि�या� है�।

• Natural Language Generationo Good said or Well said ?

o Baby chaning room (what is changed?)

Challenges of MWE

• ಈ ಕೆಲಸವು ಕಬ್ಬಿ�ಣದ ಕಡಲೆಯೇ ಸರಿo Transliteration: I kelasavu kabbiNada kaDaleyE

sario Gloss: this job iron nut correcto Translation: This job is a hard nut to cracko Google: This work is strong meat

• ಯಾ�ರ ಹತ್ತಿ�ರವೂ ಕೆ� ಚಾಚಬೇ�ಡo Transliteration: yAra hattiravU kai chAchabEDao Gloss: which near hand no extendo Translation: Do not ask help from anybodyo Google: Whose ever hand cacabeda

MWE Extraction Taxonomy

Rule Based Empirical

MWE Extraction Techniques

Statistical Measures Based

Similarity based

Thesaurus based

Distributional Word Representation

MultiWord Extraction Process Artificial Neural Networks(ANN)

successfully applied to various Natural Language Processing tasks

ANNs able to capture the semantics of the word

Use ANNs to extract MWE from the text: Deep Learning

Sample Word Representation (for Hindi)

MWE Extraction Engine Screenshots

MWE Extraction Engine Screenshots

NGIE: additional outcomes

Multi Lingual POS Projection: HMM Results with Hindi as Helper

• HMM trained on Hindi • Tested on Hindi words aligned with

source Language words

  Hindi as HelperMarathi 55.18Bengali 41.11Gujarati 42.23Punjabi 45.54

NGIE outcome: Parallel Corpora Management Workbench tool: Screenshot

Summary Project goal: Advanced IE in Multilingual

setting Involves Machine Translation and Search

too Sophisticated machine learning

techniques like Markov Logic Network, Deep Learning etc. to be used for NLP

The incumbent will get into depths of ML and NLP with active support for existing project work

Expectation: day to day project work, attending research evaluation meetings around the country, publish, create downloadable resources and tools

Thank you

Lab URL: http://www.cfilt.iitb.ac.inMy URL: http://www.cse.iitb.ac.in/~pb