ibm research © copyright ibm corporation 2005 | a development environment for configurable...

27
| IBM Research © Copyright IBM Corporation 2005 A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi, Branimir Boguraev, Mary Neff, David Ferrucci, Paul Keyser and Anthony Levas IBM T.J. Watson Research Center {youssefd,bran,ferrucci,pkeyser,levas}@us.ibm.com

Upload: buck-rich

Post on 05-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

|

IBM Research

© Copyright IBM Corporation 2005

A Development Environment for Configurable Meta-Annotators in a Pipelined

NLP Architecture

Youssef Drissi, Branimir Boguraev, Mary Neff, David Ferrucci, Paul Keyser and Anthony Levas

IBM T.J. Watson Research Center

{youssefd,bran,ferrucci,pkeyser,levas}@us.ibm.com

Page 2: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Outline

Background:

- Text Analytics

- Unstructured Information Management Architecture (UIMA)

The Challenges

- The Consumability Challenges

Our Approach to meet these challenges

- The Concept-Centric Approach

- Our Text Analytics Development Cycle

A Scenario (Demo)

- Detecting sentiments about cars from a corpus of car reviews

Page 3: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Text Analytics

FredFred isis thetheCenterCenter CEOCEO ofof

OrganizationOrganizationPersonPerson

CeoOfCeoOf

Arg2:OrgArg2:OrgArg1:PersonArg1:Person

PPPPVPVPNPNPParserParser

Named EntityNamed Entity

RelationshipRelationship

CenterCenter MicrosMicros

UIMA: Unstructured Information Management Architecture

Page 4: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

UIMA: A runtime framework for Text Analytics

UIMA: Unstructured Information Management Architecture

CEO RelationshipCEO RelationshipPERSON FinderPERSON FinderPOS TaggerPOS TaggerTokenizerTokenizer COMPANY FinderCOMPANY Finder

data

PERSONCOMPANYCEO Relationship

Conceptsanalysisresults

annotators

List of termsDictionariesRegular expressionsPattern filesStatistical modelsetc.

Modelsrepresented

by

Page 5: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Sample Annotator: Java Code /**

* This annotator searches for person titles using simple string matching.

*

* @param aTCAS TCAS containing document text and previously discovered

* annotations, and to which new annotations are to be written.

* @param aResultSpec A list of output types and features that this annotator

* should produce.

*

* @see com.ibm.uima.analysis_engine.annotator.TextAnnotator#process(TCAS, ResultSpecification)

*/

public void process(TCAS aTCAS, ResultSpecification aResultSpec)

throws AnnotatorProcessException

{

try

{

//If the ResultSpec doesn't include the PersonTitle type, we have

//nothing to do.

if (!aResultSpec.containsType("example.PersonTitle"))

{

return;

}

if (mContainingType == null)

{

//Search the whole document for PersonTitle annotations

String text = aTCAS.getDocumentText();

annotateRange(aTCAS, text, 0, aResultSpec);

}

else

{

//Search only within annotations of type mContainingType

// Get an iterator over the annotations of type mContainingType.

FSIterator it = aTCAS.getAnnotationIndex(mContainingType).iterator();

// Loop over the iterator.

while (it.isValid())

{

// Get the next annotation from the iterator

AnnotationFS annot = (AnnotationFS) it.get();

// Get text covered by this annotation

String coveredText = annot.getCoveredText();

// Get begin position of this annotation

int annotBegin = annot.getBegin();

//search for matches within this

annotateRange(aTCAS, coveredText, annotBegin, aResultSpec);

// Advance the iterator.

it.moveToNext();

}

}

}

catch(Exception e)

{

throw new AnnotatorProcessException(e);

}

}

Page 6: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

# Shallow parser cascade: level 8

honour % SUB[] , PSUB[] , Phrase[] ; boundary % Sentence[] ;

#_____# auxtensed = Token[_unilex=~"VB+AUX:P"] | Token[_unilex=~"VB+AUX:Z"] | Token[_unilex=~"VB+AUX:D"] ;

vrbtensed = Token[_unilex=~"VB-AUX:P"] | Token[_unilex=~"VB-AUX:Z"] | Token[_unilex=~"VB-AUX:D"] ; vrbuntensed = Token[_unilex=~"VB-AUX:I"] ;

vrbgrpmodal = ( VG[@descend] . Token[_unilex=~"MD"] . Token[_unilex=~"RB"]* . ( ( Token[_unilex=~"VB-AUX:I"] ) | ( Token[_unilex=~"VB+AUX:I"] . Token[_unilex=~"VB-AUX:G"] ) ) . Token[_unilex=~"RB"]* . <U> ) | ( PVG[@descend] . Token[_unilex=~"MD"] . Token[_unilex=~"RB"]* . Token[_unilex=~"VB+AUX:I"] . Token[_unilex=~"RB"]* . Token[_unilex=~"VB-AUX:N"] . Token[_unilex=~"RB"]* . <U> ) ;

vrbgrpinfform = VG[@descend] . Token[_orth=~*SWORD]* . Token[_unilex=~"VB:I"] . <U> ;

Sample Annotator: AFST Grammar Syntax

#_____

simplenp = NP[] ;# simple noun phrase

possnp = PNP[] ;# possessive noun phrase

npp = NPP[] ;# noun phrase with a trailing PP

nplist = NPList[] ;# a list of NP's

complexnp = CNP[] ;# complex (appositive) NP

npphrase = :simplenp |

:possnp |

:npp |

:nplist |

:complexnp ; # an entity behaving like an NP

#______

export

scannerEight = ( :vrbgrptensed | :vrbgrpinfform ) .

Token[_unilex=~"RP"]|<E> .

<E>/[OBJ . :npphrase . <E>/]OBJ ;

Page 7: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Sample Annotator: Semantic Dictionary Authority File <?xml version="1.0" encoding="UTF-8"?> <authority name="BlueJAuthority">

<FirstName> <class name="First" superclass="NameComponent"> <instance base="Ronald" variant="Ronney" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronni" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronnie" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronny" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Rony" confidence="1.0" syncat="np" /> </class> </FirstName>

<MovieTitle> <class name="MovieTitle" superclass="Art"> <instance base="12 Angry Men" variant="" confidence="1.0" syncat="np" /> <instance base="2001 : A Space Odyssey" variant="" confidence="1.0" syncat="np" /> <instance base="25th Hour" variant="" confidence="1.0" syncat="np" /> <instance base="42nd Street" variant="" confidence="1.0" syncat="np" /> <instance base="A Beautiful Mind" variant="" confidence="1.0" syncat="np" /> <instance base="A Clockwork Orange" variant="" confidence="1.0" syncat="np" /> <instance base="A Farewell to Arms" variant="" confidence="1.0" syncat="np" /> <instance base="A Few Good Men" variant="" confidence="1.0" syncat="np" /> <instance base="A League of Their Own" variant="" confidence="1.0" syncat="np" /> <instance base="A Letter to Three Wives" variant="" confidence="1.0" syncat="np" /> <instance base="A Life Less Ordinary" variant="" confidence="1.0" syncat="np" /> <instance base="A Man for All Seasons" variant="" confidence="1.0" syncat="np" /> <instance base="A Midsummer Night 's Dream" variant="" confidence="1.0" syncat="np" /> <instance base="A New Hope" variant="" confidence="1.0" syncat="np" /> <instance base="A Night At The Opera" variant="" confidence="1.0" syncat="np" /> </class> </MovieTitle>

Page 8: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

The Consumability Challenge

Building Analytics is a complex process

- Requires highly trained individuals:• NLP Experts• UIMA Experts• Advanced Java programmers with XML skills

- Is very time consuming:• Need time for learning the UIMA framework• Need time for building the annotators

Page 9: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Key Features End to End Text Analytics Development Tool

- Supports the full Cycle of Text Analytics Development Activities

Ease Of Use

- Insulates the user from the complexity of the underlying frameworks

Concept-Centric

- Lets the user think in terms of concepts as opposed to annotators and software components

Extensibility

- Supports for plugging new model types, model editors, results viewers, and exploration tools

Page 10: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Text Analytics Development Cycle Develop

Concept Models

DevelopConcept Models

IdentifyDomain-RelevantConcepts

IdentifyDomain-RelevantConcepts

Configure&

AssembleApplication

AnalysisEngine

Configure&

AssembleApplication

AnalysisEngine

EvaluateDiscovery

Results

EvaluateDiscovery

Results

RunAnalytics

RunAnalytics

EvaluationResults

EvaluationResults

Ontology(Type System)

Ontology(Type System)

ConceptModels

ConceptModels

Concept Finder

Concept Finder

Start

StructuredInformationStructuredInformation

Corpus & Domain

Exploration

Corpus & Domain

Exploration

Type SystemDevelopment

Type SystemDevelopment

Page 11: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Scenario:Detecting Sentiments about Cars and Car Features

Page 12: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Demo

Page 13: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Conclusion This work addresses the text analytics consumability challenges with

Platform, that provides:

- Support the full Cycle of Text Analytics Development Activities

- Ease Of Use

- Support for a Concept-Centric development process

- Extensibility

Page 14: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Thank YouMerci

Shoukran

Page 15: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Concepts

- Concepts to find in Text

Documents

- Corpora that can be used in analysis

Concept Finders

- Analysis Engines built from concept models

Results

- Results from running Concept Finder on Corpora.

Overview

Page 16: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Page 17: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

GlossEx: Domain Exploration Tool

Domain Exploration

Page 18: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Ontology

- A group of concepts in a domain

Concept

- A Concept in the domain

Model

- Analytic for finding a specific Concept

Ontologies, Concepts and Models

Page 19: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Build CarAspectModel using Semantic Dictionary CAT

1. Enter a representative Term

2. Select synonyms (e.g. From WordNet)

3. Store Terms in a dictionary

Building Models For Concepts

Page 20: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Build CarAspectModel using Semantic Dictionary CAT

1. add representative Terms

2. Select synonyms (e.g. From WordNet)

3. Store Terms in a dictionary

Building Models For Concepts

Page 21: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Build CarSentimentModel using AFST CAT

1. Drag and Drop ConceptModels onto WorkArea

2. Interconnect to define pattern sequence

Building Models

Page 22: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Build a ConceptFinder for CarSentiments

1. Select All Relevant Concepts

2. The System generates a ConceptFinder for the selected concepts

Building ConceptFinders

Page 23: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Run ConceptFinder on a Corpus

1. Select ConceptFinder

2. Select Corpus

3. Run the analysis

Running Analytics to get Results

Page 24: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Annotations Viewer

Results Evaluation

Page 25: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Concordance Viewier

Iterative Refinement Tools

Page 26: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Collection Level Statistics : Comparing Results

Results Evaluation

Page 27: IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

IBM Research

© Copyright IBM Corporation 2003

Plugin Components: CATs & KoGs

Dictionary Configurable

Annotator

Configurable Annotator

Semantic Dictionary UI

CATs Plugin Framework

CAT

Concordance Indexer

KoG

KoGs Plugin Framework

Concordance Explorer UIKoG