introduction to artificial intelligence truc-vien t. nguyen lab: named entity recognition
TRANSCRIPT
![Page 1: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/1.jpg)
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
Truc-Vien T. Nguyen
Lab: Named Entity Recognition
![Page 2: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/2.jpg)
Download
• Slideshttp://sites.google.com/site/trucviennguyen/Lab NER -- Vien.pdf
• Softwarehttp://sites.google.com/site/trucviennguyen/Teaching/AI/SSHSecureShellClient-3.2.9.rar
![Page 3: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/3.jpg)
Natural Language Processing (NLP)
• Main purpose of NLP– Build systems able to analyze, understand and
generate languages which human use naturally• Involved Tasks
– Automatic Summarization– Information Extraction– Speech Recognition– Machine Translation– …
![Page 4: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/4.jpg)
Information Extraction (1)
Mapping of texts into fixed structure representing the key informations
News 3
News 2
News 1
Form 3
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Form 2
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
Form 1
WHO: vcvcvcvcvcvcvcvcvc
WHAT: vcvcvcvcvcvcvcvcvc
WHEN: vcvcvcvcvcvcvcvcvc
![Page 5: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/5.jpg)
Information Extraction (2)
Sam Brown retired as executive vice president of the famous hot dog manufacturer, Hupplewhite Inc. He will be succeeded by Harry Jones.
EVENT: leave job
Person: Sam Brown
Position: executive vice president
Company: Hupplewhite Inc.
EVENT: start job
Person: Harry Jones
Position: executive vice president
Company: Hupplewhite Inc.
![Page 6: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/6.jpg)
Entity and Relation
• Entity– An object in the world– Ex. President Bush was in Washington today– Example: Person, Organization, Location, GPE
• Relation– A relationship between two entities– Ex. LocatedIn(“Bush”, “Washington”)– Example: LocatedIn, Family, Employment
![Page 7: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/7.jpg)
Named Entity Recognition
• Named Entity Recognition– Subtask of information extraction– Locate and classify elements in text into predefined
categories: names of persons, organizations, locations, expressions of times, etc
• Example– James Clarke, director of ABC company
(Person) (Organization)
![Page 8: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/8.jpg)
CoNLL2003 shared task (1)
• English and German language• 4 types of NEs:
– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person
• Training Set for developing the system• Test Data for the final evaluation
![Page 9: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/9.jpg)
CoNLL2003 shared task (2)
• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format
• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O
![Page 10: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/10.jpg)
CoNLL2003 shared task (3)
English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51% 88.31%[KSNM03] 85.93% 86.21% 86.07%[ZJ03] 86.13% 84.88% 85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%
baseline 71.91% 50.90% 59.61%
![Page 11: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/11.jpg)
Dataset
• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens
• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens
• Mention Detection-- ACE 2005– 599 documents
![Page 12: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/12.jpg)
CRF++ (1)
• Can redefine feature sets• Written in C++ with STL• Fast training based on LBFGS for large scale• Less memory usage both in training and testing• encoding/decoding in practical time• Available as an open source software
http://crfpp.googlecode.com/svn/trunk/doc/index.html
![Page 13: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/13.jpg)
CRF++ (2)
• use Conditional Random Fields (CRFs)• CRFs methodology: use statistical correlated features
and train them discriminatively• simple, customizable, and open source
implementation• for segmenting/labeling sequential data• can define
– unigram/bigram features– relative positions (windows-size)
![Page 14: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/14.jpg)
Template basic
• An example:He PRP B-NPreckons VBZ B-VPthe DT B-NP << CURRENT TOKENcurrent JJ I-NPaccount NN I-NP
Template Expanded feature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DT
![Page 15: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/15.jpg)
A Case Study
• Installing CRF++• Data for Training and Test• Making the baseline• Training CRF++ on the
– NER dataset: English CoNLL2003, Italian EVALITA– Mention classification: ACE 2005 dataset
• Annotating the test corpus with CRF++• Evaluating results• Exercise
![Page 16: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/16.jpg)
Installing CRF++
• First, ssh compute-0-x where x=1..10• Unzip the lab--NER.tar.gz file (tar -xvzf lab--
NER.tar.gz) • Enter the lab--NER directory
– Unzip the CRF++-0.54.tar.gz file (tar -xvzf CRF++-0.54.tar.gz)
– Enter the CRF++-0.54 directory– Run ./configure– Run make
![Page 17: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/17.jpg)
Training/Classification (1)
• Notations– xxx train_it.dat/train_en.dat/train_mention.dat– nnn it.model/en.model/mention.model– yyy test_it.dat/test_en.dat/test_mention.dat– zzz test_it.tagged/test_en.tagged/
test_mention.tagged– ttt test_it.eval/test_en. eval/test_mention.eval
• Note that the test_it.dat already contains the right NE tags but the system is not using this information for tagging the data
![Page 18: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/18.jpg)
Training/Classification (2)
• Enter the CRF++-0.54 directory• Training
./crf_learn ../templates/template_4 ../corpus/xxx ../models/nnn
• Classification./crf_test -m ../models/nnn ../corpus/yyy > ../corpus/zzz
• Evaluationperl ../eval/conlleval.pl ../corpus/zzz > ../corpus/ttt
• See the resultscat ../corpus/ttt
![Page 19: INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition](https://reader036.vdocuments.us/reader036/viewer/2022082817/56649e315503460f94b22757/html5/thumbnails/19.jpg)
THANKS
• I used material from– Text Processing II: Bernardo Magnini– Lab Text Processing II: Roberto Zanoli