introduction to artificial intelligence truc-vien t. nguyen lab: named entity recognition
TRANSCRIPT
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Truc-Vien T. Nguyen
Lab: Named Entity Recognition
Download
• Slideshttp://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf
• Softwarehttp://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar
Natural Language Processing (NLP)
• Main purpose of NLP– Build systems able to analyze, understand and
generate languages which human use naturally• Involved Tasks
– Automatic Summarization– Information Extraction– Speech Recognition– Machine Translation– …
Information Extraction (1)
Mapping of texts into fixed structure representing the key informations
News 3
News 2
News 1
Form 3
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Form 2
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Form 1
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Information Extraction (2)
Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones.
EVENT: leave job
Person: Sam Brown
Position: executive vice president
Company: Hupplewhite Inc.
EVENT: start job
Person: Harry Jones
Position: executive vice president
Company: Hupplewhite Inc.
Entity and Relation
• Entity– An object in the world– Ex. President Bush was in Washington today– Example: Person, Organization, Location, GPE
• Relation– A relationship between two entities– Ex. LocatedIn(“Bush”, “Washington”)– Example: LocatedIn, Family, Employment
Named Entity Recognition
• Named Entity Recognition– Subtask of information extraction– Locate and classify elements in text into predefined
categories: names of persons, organizations, locations, expressions of times, etc
• Example– James Clarke, director of ABC company
(Person) (Organization)
CoNLL2003 shared task (1)
• English and German language• 4 types of NEs:
– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person
• Training Set for developing the system• Test Data for the final evaluation
CoNLL2003 shared task (2)
• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format
• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O
CoNLL2003 shared task (3)
English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%
baseline 71.91% 50.90% 59.61%
Dataset
• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens
• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens
• Mention Detection-- ACE 2005– 599 documents
CRF++ (1)
• Can redefine feature sets• Written in C++ with STL• Fast training based on LBFGS for large scale• Less memory usage both in training and testing• encoding/decoding in practical time• Available as an open source software
http://crfpp.googlecode.com/svn/trunk/doc/index.html
CRF++ (2)
• use Conditional Random Fields (CRFs)• CRFs methodology: use statistical correlated features
and train them discriminatively• simple, customizable, and open source
implementation• for segmenting/labeling sequential data• can define
– unigram/bigram features– relative positions (windows-size)
Template basic
• An example:He PRP B-NPreckons VBZ B-VPthe DT B-NP << CURRENT TOKENcurrent JJ I-NPaccount NN I-NP
Template Expanded feature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DT
A Case Study
• Installing CRF++• Data for Training and Test• Making the baseline• Training CRF++ on the
– NER dataset: English CoNLL2003, Italian EVALITA– Mention classification: ACE 2005 dataset
• Annotating the test corpus with CRF++• Evaluating results• Exercise
Installing CRF++
• First, ssh compute-0-x where x=1..10• Unzip the lab--NER.tar.gz file (tar -xvzf lab--
NER.tar.gz) • Enter the lab--NER directory
– Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++-0.54.tar.gz)
– Enter the CRF++-0.54 directory– Run ./configure– Run make
Training/Classification (1)
• Notations– xxx train_it.dat/train_en.dat/train_mention.dat– nnn it.model/en.model/mention.model– yyy test_it.dat/test_en.dat/test_mention.dat– zzz test_it.tagged/test_en.tagged/
test_mention.tagged– ttt test_it.eval/test_en. eval/test_mention.eval
• Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data
Training/Classification (2)
• Enter the CRF++-0.54 directory• Training
./crf_learn ../templates/template_4 ../corpus/xxx ../models/nnn
• Classification./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz
• Evaluationperl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt
• See the resultscat ../corpus/ttt
THANKS
• I used material from– Text Processing II: Bernardo Magnini– Lab Text Processing II: Roberto Zanoli