introduction to artificial intelligence truc-vien t. nguyen lab: named entity recognition

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Truc-Vien T. Nguyen

Lab: Named Entity Recognition

Download

• Slideshttp://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf

• Softwarehttp://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar

Natural Language Processing (NLP)

• Main purpose of NLP– Build systems able to analyze, understand and

generate languages which human use naturally• Involved Tasks

– Automatic Summarization– Information Extraction– Speech Recognition– Machine Translation– …

Information Extraction (1)

Mapping of texts into fixed structure representing the key informations

News 3

News 2

News 1

Form 3

WHO: vcvcvcvcvcvcvcvcvc

WHAT: vcvcvcvcvcvcvcvcvc

WHEN: vcvcvcvcvcvcvcvcvc

Form 2




Form 1




Information Extraction (2)

Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones.

EVENT: leave job

Person: Sam Brown

Position: executive vice president

Company: Hupplewhite Inc.

EVENT: start job

Person: Harry Jones

Position: executive vice president

Company: Hupplewhite Inc.

Entity and Relation

• Entity– An object in the world– Ex. President Bush was in Washington today– Example: Person, Organization, Location, GPE

• Relation– A relationship between two entities– Ex. LocatedIn(“Bush”, “Washington”)– Example: LocatedIn, Family, Employment

Named Entity Recognition

• Named Entity Recognition– Subtask of information extraction– Locate and classify elements in text into predefined

categories: names of persons, organizations, locations, expressions of times, etc

• Example– James Clarke, director of ABC company

(Person) (Organization)

CoNLL2003 shared task (1)

• English and German language• 4 types of NEs:

– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person

• Training Set for developing the system• Test Data for the final evaluation


• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format

• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O


English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%

baseline 71.91% 50.90% 59.61%

Dataset

• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens

• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens

• Mention Detection-- ACE 2005– 599 documents

CRF++ (1)

• Can redefine feature sets• Written in C++ with STL• Fast training based on LBFGS for large scale• Less memory usage both in training and testing• encoding/decoding in practical time• Available as an open source software

http://crfpp.googlecode.com/svn/trunk/doc/index.html

CRF++ (2)

• use Conditional Random Fields (CRFs)• CRFs methodology: use statistical correlated features

and train them discriminatively• simple, customizable, and open source

implementation• for segmenting/labeling sequential data• can define

– unigram/bigram features– relative positions (windows-size)

Template basic

• An example:He PRP B-NPreckons VBZ B-VPthe DT B-NP << CURRENT TOKENcurrent JJ I-NPaccount NN I-NP

Template Expanded feature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DT

A Case Study

• Installing CRF++• Data for Training and Test• Making the baseline• Training CRF++ on the

– NER dataset: English CoNLL2003, Italian EVALITA– Mention classification: ACE 2005 dataset

• Annotating the test corpus with CRF++• Evaluating results• Exercise

Installing CRF++

• First, ssh compute-0-x where x=1..10• Unzip the lab--NER.tar.gz file (tar -xvzf lab--

NER.tar.gz) • Enter the lab--NER directory

– Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++-0.54.tar.gz)

– Enter the CRF++-0.54 directory– Run ./configure– Run make

Training/Classification (1)

• Notations– xxx train_it.dat/train_en.dat/train_mention.dat– nnn it.model/en.model/mention.model– yyy test_it.dat/test_en.dat/test_mention.dat– zzz test_it.tagged/test_en.tagged/

test_mention.tagged– ttt test_it.eval/test_en. eval/test_mention.eval

• Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data

Training/Classification (2)

• Enter the CRF++-0.54 directory• Training

./crf_learn ../templates/template_4 ../corpus/xxx ../models/nnn

• Classification./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz

• Evaluationperl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt

• See the resultscat ../corpus/ttt

THANKS

• I used material from– Text Processing II: Bernardo Magnini– Lab Text Processing II: Roberto Zanoli

introduction to artificial intelligence truc-vien t. nguyen lab: named entity recognition

Documents