applications of natural language processing

1

Applications of Natural Language

ProcessingCourse 6 – 29 March 2012

Diana Trandabăț[email protected]

2

What is Named Entity Recognition Corpora, annotation Evaluation and testing Preprocessing Approaches to NE◦ Baseline◦ Rule-based approaches◦ Learning-based approaches

Multilinguality Applications

Content

3

Information Extraction (IE) proposes techniques to extract relevant information from non-structured or semi-structured texts

Extracted information is transformed so that it can be represented in a fixed (computer-readable) format

Remember

4

Named Entity Recognition (NER) is an IE task that seeks to locate and classify text segments into predefined classes (for exemple Person, Location, Time expression)

We are proud to announce that Friday, February 17, we will have two sessions in the Education Seminar. At 12:30pm, at the Student Center Room 207, Joe Mertz will present "Using a Cognitive Architecture to Design Instructions“. His session ends at 1pm. After a small lunch break, at 14:00, we meet again at Student Center Room 208, where Brian McKenzie will start his presentation. He will present “Information Extraction: how to automatically learn new models”. This session ends around 15h.

Named Entity Recognition (NER)

5

Person Entity Recognition (NER) is an IE task Location to locate and classify text segments Time predefined classes (for exemple Person, Location, Time expression)



6




7




What are Named Entities? NER involves two sub-tasks:

◦Identification of proper names in texts (Named Entity Identification – NEI)

◦Classification into a set of predefined categories of interest (Named Entity Classification – NEC)

8

What are Named Entities Usual categories:

◦ Person names, Organizations (companies, government organisations, committees, etc), Locations (cities, countries, rivers, etc), Date and time expressions

Other common types: ◦ measures (percent, money, weight etc), email

addresses, Web addresses, street addresses, etc. Some domain-specific entities:

◦ names of drugs, medical conditions, names of ships, bibliographic references etc.

9

Basic Problems in NE Variation of NEs – e.g. John Smith, Mr Smith,

John. Ambiguity of NE types:

◦ John Smith (company vs. person) ◦ May(person vs. month) ◦ Washington (person vs. location) ◦ 1945 (date vs. time)

Ambiguity with common words, e.g. "may"

10

More complex problems in NE Issues of style, structure, domain, genre

etc. Punctuation, spelling, spacing,

formatting, ... all have an impact:Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom

Tell me more about Leonardo

Da Vinci

11

Some NE Annotated Corpora MUC (Message Understanding Conference)-6

and MUC-7 corpora - English CONLL shared task corpora

http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and Germanhttp://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch

TIDES surprise language exercise (NEs in Cebuano and Hindi)

ACE (Automatic Content Extraction) – English http://www.ldc.upenn.edu/Projects/ACE/

12

http://cnts.uia.ac.be/conll2003/ner/

http://cnts.uia.ac.be/conll2002/ner/

http://www.ldc.upenn.edu/Projects/ACE/




The MUC-7 corpus 100 documents in SGML News domain 1880 Organizations (46%) 1324 Locations (32%) 887 Persons (22%) Inter-annotator agreement very high

(~97%) http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf

13

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/marsh_slides.pdf



The MUC-7 Corpus (2)<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>,

<ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission.

<p>Endeavour, with an international crew of six, was set to blast

off from the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness.

14

15(110)

NE Annotation Tools - GATE

Pre-processing for NER Format detection Word segmentation (for languages like

Chinese) Tokenisation Sentence splitting POS tagging

16

17

NER systems have been created that use linguistic grammer-based techniques as well as statistical methods.◦Hand-crafted grammar-based systems

typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguistics.

◦Statistical NER systems typically require a large amount of manually annotated training data.

NER Systems

From Corpora to System Development Corpora are divided typically into a training and

testing portion Rules/Learning algorithms are trained on the

training part Tuned on the testing portion in order to optimise

◦ Rule priorities, rules effectiveness, etc.◦ Parameters of the learning algorithm and the features

used Evaluation set – the best system configuration is

run on this data and the system performance is obtained

No further tuning once evaluation set is used!18

Two kinds of NE approachesKnowledge Engineering

rule based developed by experienced language engineers make use of human intuition requires only small amount of training data development could be very time consuming some changes may be hard to accommodate

Learning Systems use statistics or other machine learning developers do not need advanced language engineering

expertise requires large amounts of annotated training data some changes may require re-annotation of the entire

training corpus

19

Baseline: list lookup approach System that recognises only entities stored

in its lists (gazetteers). Advantages - Simple, fast, language

independent, easy to retarget (just create lists)

Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

20

Creating Gazetteer Lists

Online phone directories and yellow pages for person and organisation names

Locations lists◦ http://ro.wikipedia.org/wiki/Format:Listele_localit%

C4%83%C8%9Bilor_din_Rom%C3%A2nia_pe_jude%C8%9Be

Names lists◦ http://ro.wikipedia.org/wiki/List%C4%83_de_nume

_rom%C3%A2ne%C8%99ti Automatic collection from annotated

training data21

http://ro.wikipedia.org/wiki/Format:Listele_localit%C4%83%C8%9Bilor_din_Rom%C3%A2nia_pe_jude%C8%9Be



http://ro.wikipedia.org/wiki/List%C4%83_de_nume_rom%C3%A2ne%C8%99ti

http://ro.wikipedia.org/wiki/List%C4%83_de_nume_rom%C3%A2ne%C8%99ti

Shallow Parsing Approach (internal structure) Internal evidence – names often have internal

structure. These components can be either stored or guessed, e.g. location:

Cap. Word + {City, Forest, Center, River}

e.g. Sherwood Forest

Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}

e.g. Portobello Street

22

Problems with the shallow parsing approach

Ambiguously capitalised words (first word in sentence)[All American Bank] vs. All [State Police]

Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organisation

Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]

23

Shallow Parsing Approach with Context Use of context-based patterns is helpful in

ambiguous cases ◦ "David Walton" and "Goldman Sachs" are

indistinguishable ◦ But in "David Walton of Goldman Sachs"

if we have "David Walton” recognised as Person

we can use the pattern "[Person] of [Organization]“

and identify "Goldman Sachs“ correctly.

24

Examples of context patterns [PERSON] earns [MONEY] [PERSON] joined [ORGANIZATION] [PERSON] left [ORGANIZATION] [PERSON] joined [ORGANIZATION] as [JOBTITLE] [ORGANIZATION]'s [JOBTITLE] [PERSON] [ORGANIZATION] [JOBTITLE] [PERSON] the [ORGANIZATION] [JOBTITLE] part of the [ORGANIZATION] [ORGANIZATION] headquarters in [LOCATION] price of [ORGANIZATION] sale of [ORGANIZATION] investors in [ORGANIZATION] [ORGANIZATION] is worth [MONEY] [JOBTITLE] [PERSON] [PERSON], [JOBTITLE]

25

Context patterns Patterns are only indicators based on

likelihood Can set priorities based on frequency

thresholds Need training data for each domain More semantic information would be

useful (e.g. to cluster groups of verbs)

26

Example Rule-based System - ANNIE Created as part of GATE GATE – Sheffield’s open-source

infrastructure for language processing GATE automatically deals with document

formats, saving of results, evaluation, and visualisation of results for debugging

GATE has a finite-state pattern-action rule language, used by ANNIE

ANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 corpus

27

NE ComponentsThe ANNIE system – a reusable and easily extendable set of components

28

Gazetteer lists for rule-based NE Needed to store the indicator strings for

the internal structure and context rules Internal location indicators – e.g., {river,

mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations

Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …}

Produces Lookup results of the given kind

29

Using co-reference to classify ambiguous NEs Orthographic co-reference module that

matches proper names in a document Improves NE results by assigning entity

type to previously unclassified names, based on relations with classified NEs

May not reclassify already classified entities Classification of unknown entities very

useful for surnames which match a full name, or abbreviations, e.g. [Napoleon] will match [Napoleon Bonaparte]; [International Business Machines Ltd.] will match [IBM]

30

Machine Learning Approaches ML approaches frequently break down the

NER task in two parts:◦ Recognising the entity boundaries◦ Classifying the entities in the NE categories

Work is usually only on one task or the other Tokens in text are often coded with the IOB

scheme ◦ O – outside, B-NE – first word in NE, I-NE– all other

words in NE◦ Argentina B-LOC

played Owith ODel B-PERBosque I-PER

31

IdentiFinder [Bikel et al 99] Based on Hidden Markov Models Features

◦ Capitalisation◦ Numeric symbols◦ Punctuation marks◦ Position in the sentence◦ 14 features in total, combining above info, e.g.,

containsDigitAndDash (09-96), containsDigitAndComma (23,000.00)

32

IdentiFinder (2) MUC-6 (English) and MET-1(Spanish) corpora

used for evaluation Mixed case English

◦ IdentiFinder - 94.9% f-measure Spanish mixed case

◦ IdentiFinder – 90%◦ Lower case names, noisy training data, less

training data Training data: 650,000 words, but similar

performance with half of the data. Less than 100,000 words reduce the performance to below 90% on English

33

Fine-grained Classification of NEs [Fleischman 02] Finer-grained categorisation needed for

applications like question answering Person classification into 8 sub-categories:

athlete, politician/government, clergy, businessperson, entertainer/artist, lawyer, doctor/scientist, police.

Approach using local context and global semantic information such as WordNet

Used a decision list classifier and Identifinder to construct automatically training set from untagged data

Held-out set of 1300 instances hand annotated34

Fine-grained Classification of NEs (2) Word frequency features – how often the words

surrounding the target instance occur with a specific category in training◦ For each 8 categories 10 distinct word positions = 80 features

per instance◦ 3 words before & after the instance◦ The two-word bigrams immediately before and after the instance◦ The three-word trigrams before/after the instance

# Position N-gram Category Freq.1 Previous unigram introduce politician 32 Previous unigram introduce entertainer 433 Following bigram into that politician 24 Following bigram into that business 0

Fine-grained Classification of NEs (3) Topic signatures and WordNet information

◦ Compute lists of terms that signal relevance to a topic/category [Lin&Hovy 00] & expand with WordNet synonyms to counter unseen examples

◦ Politician – campaign, republican, budget The topic signature features convey

information about the overall context in which each instance exists

Due to differing contexts, instances of the same name in a single text were classified differently

36

Performance Evaluation Evaluation metric – mathematically defines

how to measure the system’s performance against a human-annotated, gold standard

Scoring program – implements the metric and provides performance measures ◦ For each document and over the entire corpus◦ For each type of NE

37

The Evaluation Metric Precision = correct answers/answers

produced Recall = correct answers/total possible

correct answers Trade-off between precision and recall F-Measure = (β2 + 1)PR / β2R + P

[van Rijsbergen 75] β reflects the weighting between precision

and recall, typically β=1

38

The Evaluation Metric (2) We may also want to take account of partially

correct answers: Precision =

Correct + ½ Partially correctCorrect + Incorrect + Partial

Recall = Correct + ½ Partially correctCorrect + Missing + Partial

Why: NE boundaries are often misplaced, sosome partially correct results

39

The GATE Evaluation Tool

40

Multilingual Named Entity Recognition

Recent experiments are aimed at NE recognition in multiple languages

TIDES surprise language evaluation exercise measures how quickly researchers can develop NLP components in a new language

CONLL’02, CONLL’03 focus on language-independent NE recognition

41

Analysis of the NE Task in Multiple Languages [Palmer&Day 97]Language NE Time/

DateNumeric exprs.

Org/Per/Loc

Chinese 4454 17.2% 1.8% 80.9%

English 2242 10.7% 9.5% 79.8%

French 2321 18.6% 3% 78.4%

Japanese 2146 26.4% 4% 69.6%

Portuguese 3839 17.7% 12.1% 70.3%

Spanish 3579 24.6% 3% 72.5%

Analysis of Multilingual NE (2) Numerical and time expressions are very

easy to capture using rules Constitute together about 20-30% of all NEs All numerical expressions in the 6 languages

required only 5 patterns Time expressions similarly require only a few

rules (less than 30 per language) Many of these rules are reusable across the

languages

43

What is needed for multilingual NE Extensive support for non-Latin scripts and

text encodings, including conversion utilities◦ Automatic recognition of encoding [Ignat et al03]◦ Occupied up to 2/3 of the TIDES Hindi effort

Bi-lingual dictionaries Annotated corpus for evaluation Internet resources for gazetteer list

collection (e.g., phone books, yellow pages, bi-lingual pages)

44

Multilingual Data - GATEAll processing, visualisation and editing tools use GUK

45

Gazetteer-based Approach to Multilingual NE [Ignat et al 03]

Deals with locations only Even more ambiguity than in one language:

◦ Multiple places that share the same name, such as the fourteen cities and villages in the world called ‘Paris’

◦ Place names that are also words in one or more languages, such as ‘And’ (Iran), ‘Split’ (Croatia)

◦ Places have varying names in different languages (Italian ‘Venezia’ vs. English ‘Venice’, German ‘Venedig’, French ‘Venise’)

46

Gazetteer-based multilingual NE (2) Disambiguation module applies heuristics

based on location size and country mentions (prefer the locations from the country mentioned most)

Performance evaluation:◦ 853 locations from 80 English texts◦ 96.8% precision◦ 96.5% recall

47

Machine Learning for Multilingual NE CONLL’2002 and 2003 shared tasks were NE

in Spanish, Dutch, English, and German The most popular ML techniques used:

◦ Maximum Entropy (5 systems)◦ Hidden Markov Models (4 systems)◦ Connectionist methods (4 systems)

Combining ML methods has been shown to boost results

48

ML for NE at CONLL (2) The choice of features is at least as

important as the choice of ML algorithm◦ Lexical features (words)◦ Part-of-speech◦ Orthographic information◦ Affixes◦ Gazetteers

External, unmarked data is useful to derive gazetteers and for extracting training instances

49

Applications of NER Named Entity Recognition in Web Search Medical NER (Medline abstracts)

50

Named Entity Recognition in Web Search 71% of the queries in search engines

contain named entities These named entities may be useful to

process the query

51

Named Entity Recognition in Web Search Motivating Examples

◦Consider the query “harry potter walkthrough” The context of the query strongly indicates

that the named entity “harry potter” is a “Game”

◦Consider the query “harry potter cast” The context of the query strongly indicates

that the named entity “harry potter” is a “Movie”

52

Named Entity Recognition in Web Search Identifying named entities can be very

useful. Consider the following examples related to the query “harry potter walkthrough”:◦Ranking: Documents about videogames

should be pushed up in the rankings◦Suggestion: Relevant suggestions can be

generated like “harry potter cheats” or “lord of the rings walkthrough”

53

54

Identification of Protein and Gene Terms in medical texts

The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities.

OSIRIS http://ibi.imim.es/OSIRISv1.2.html

Medical NER

http://ibi.imim.es/OSIRISv1.2.html

Future challenges Towards semantic tagging of entities New evaluation metrics for semantic entity

recognition Expanding the set of entities recognised –

e.g., vehicles, weapons, substances (food, drug)

Finer-grained hierarchies, e.g., types of Organizations (government, commercial, educational, etc.), Locations (regions, countries, cities, water, etc)

55

56

1) Build NER gazetteers for Romanian◦ Extract from the Wikipedia lists of Romanian names for male/female;◦ Extract from the Wikipedia lists of Romanian cities;◦ Extract from the Internet lists of Romanian companies

2) Extract from texts email addresses and phone numbers of any format:◦ TEL +40-722-222-222, Phone: (722) 222-222, Tel (+40): 232222222, ◦ emails including “john(at)smith.inc.edu” or “john(la)info punct uaic punct ro”

3) Extract as many dates as possible from texts, including “23 iunie 2009”, “ieri”, “anul trecut”, “toamna 2001”, “la ora 2 pm” etc.

4) Use the gazetteers and the programs from above to extract NER from Romanian Wikipedia pages. Write the output in XML format.

Requirements (Team: max 1 person, Deadline: 4 April)

Further reading Borthwick. A. A Maximum Entropy Approach to Named Entity

Recognition. PhD Dissertation. 1999 Chinchor. N. MUC-7 Named Entity Task Definition Version 3.5.

Available by from ftp.muc.saic.com/pub/MUC/MUC7-guidelines, 1997

C. Ignat and B. Pouliquen and A. Ribeiro and R. Steinberger. Extending and Information Extraction Tool Set to Eastern-European Languages. Proceedings of Workshop on Information Extraction for Slavonic and other Central and Eastern European Languages (IESL'03). 2003.

McDonald D. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996

D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.

H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, volume 36, pp. 223-254, 2002.

K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is the reuse of grammars for Named Entity Recognition? Language Resources and Evaluation Conference (LREC'2002), 2002.

57

58

CCG Group http://

cogcomp.cs.illinois.edu/demo/ner/results.php

LingPipe http://alias-i.com/lingpipe/web/demo-ne.html

Stanford NER http://nlp.stanford.edu/software/CRF-NER.shtm

l

Links

http://cogcomp.cs.illinois.edu/demo/ner/results.php



http://alias-i.com/lingpipe/web/demo-ne.html



http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml

59

Thanks!

applications of natural language processing

Documents

student center room

task location

time expression

entity recognitioncorpora

exemple person

joe mertz

cognitive architecture

brian mckenzie