1 fefor, march 2002 named-entity recognition for swedish past, present and way ahead... dimitrios...
TRANSCRIPT
1Fefor, March 2002
Named-Entity Recognition Named-Entity Recognition for Swedishfor Swedish
Past, Present and Way Ahead...Past, Present and Way Ahead...
Dimitrios KokkinakisDimitrios Kokkinakis
2Fefor, March 2002
OutlineOutline
Looking BackLooking Back: AVENTINUS, flexers,...: AVENTINUS, flexers,... Current Status & WorkplanCurrent Status & Workplan::
Resources: Lexical, Textual and AlgorithmicResources: Lexical, Textual and Algorithmic NER on Part-of-Speech Annotated MaterialNER on Part-of-Speech Annotated Material Way Ahead, Approach and Evaluation SamplesWay Ahead, Approach and Evaluation Samples
Resource LocalizationResource Localization (if required...) (if required...) NE Tagset and GuidelinesNE Tagset and Guidelines Survey of the Market for NERSurvey of the Market for NER: Tools, Projects,...: Tools, Projects,... ProblemsProblems: Ambiguity, Metonymy, Text Format : Ambiguity, Metonymy, Text Format
(Orthography, Source Modality...)...(Orthography, Source Modality...)...
3Fefor, March 2002
Looking Back...Looking Back...
NER in the AVENTINUS project (LE4) without listsNER in the AVENTINUS project (LE4) without lists No proper evaluation on a large scaleNo proper evaluation on a large scale Collection of a few types of resources; e.g. appositivesCollection of a few types of resources; e.g. appositives Method: finite-state grammars ’semantic grammars’; Method: finite-state grammars ’semantic grammars’;
one for each categoryone for each category Delivered rules (for Swedish NER) that were compiled Delivered rules (for Swedish NER) that were compiled
in a user-required productin a user-required product
See Kokkinakis (2001): See Kokkinakis (2001): svenska.gu.se/~svedk/publics/swe_ner.pssvenska.gu.se/~svedk/publics/swe_ner.ps for a grammar for a grammar used for identifying ”Transportation Means”used for identifying ”Transportation Means”
7Fefor, March 2002
Swe-NER without ListsSwe-NER without Lists
......see the flexers example
How long can we go without lists?
9Fefor, March 2002
In the framework of...In the framework of...
my PhD, a collection of 35 documents was my PhD, a collection of 35 documents was manually tagged; newspaper articles (30) & manually tagged; newspaper articles (30) & reports from a popular science periodical (5)reports from a popular science periodical (5)
ENTITY #AMOUNT DOCUMENTS 3535
PersonsPersons 419 (419 (84f)84f) TOKENS 20,92720,927
LocationsLocations 569 (569 (89f)89f) PROPER NOUNS
1,4221,422
OrganizationsOrganizations 272 (272 (83f)83f)
TemporalTemporal 504 (504 (89f)89f)
MonetaryMonetary 80 (80 (97f)97f)
10Fefor, March 2002
Status & WorkplanStatus & Workplan
ResourcesResources Lexical, Textual and AlgorithmicLexical, Textual and Algorithmic
NER on Part-of-Speech Annotated MaterialNER on Part-of-Speech Annotated MaterialWay Ahead, Approach and Evaluation Way Ahead, Approach and Evaluation
SamplesSamples
11Fefor, March 2002
EvidenceEvidence
McDonald (1996)McDonald (1996)::InternalInternal evidence evidence:: is taken from within the sequence of is taken from within the sequence of
words that comprise the name, such as the content of words that comprise the name, such as the content of lists of proper names (gazetteers), abbreviations and lists of proper names (gazetteers), abbreviations and acronyms (acronyms (Ltd, Inc., GmbhLtd, Inc., Gmbh))
ExternalExternal evidence evidence:: provided by the context in which a provided by the context in which a name appears – the characteristic properties or events name appears – the characteristic properties or events in a syntactic relation (verbs, adjectives) with a proper in a syntactic relation (verbs, adjectives) with a proper noun can be used to provide confirming or criterial noun can be used to provide confirming or criterial evidence for a name’s category – aevidence for a name’s category – an n important type of important type of complementary information since internal evidence complementary information since internal evidence can never be complete.can never be complete.....
12Fefor, March 2002
Lexical Resources (1) Lexical Resources (1) (Internal Evidence)(Internal Evidence)
Name Lists (Gazeteers)Name Lists (Gazeteers)
Multiword namesMultiword names
Single namesSingle names
Organizations (profit): 1,200Organizations (non-profit): 60Locations: 40
Org/commerc.: 1,500Person First: 70,000Person Last: 5,000Cities non-Swe.:2,200
Org/no-comm: 200Provinces: 70Airports: 10Cities Swe.: 1,600
Countries: 230Events: 10...
13Fefor, March 2002
Lexical Resources (2) Lexical Resources (2) (Internal Evidence)(Internal Evidence)
Designators, affixes, and trigger wordsDesignators, affixes, and trigger words
Titles, premodifiers, Titles, premodifiers, appositions...appositions...
e.g. personse.g. persons
PostPostModsMods: Jr, Junior,…PreTitlesPreTitles: VD, Dr, sir,…NationalityNationality: belgaren, brasilianaren, dansken,…OccupationOccupation: amiral, kriminolog, psykolog,...
e.g. organizationse.g. organizations
Design.& TriggersDesign.& Triggers: bolaget X, föreningen X, institutet X, organisationen X, stiftelsen X, förbundet X,…X Agency, X Biotech, X Chemical, X Consultancy ,…AffixesAffixes:+kollegium,+verket,...
14Fefor, March 2002
Lexical Resources Lexical Resources (External Evidence)(External Evidence)
the the Volvo/SaabVolvo/Saab case (can be generalized) case (can be generalized) a typical, frequent and fairly difficult examplea typical, frequent and fairly difficult example
For instance:For instance: ...Saab ...Saab 90009000...... ...mellanklass...mellanklassbilar sombilar som Volvo,... Volvo,... ...att ...att köraköra Volvo i en Volvostad som... Volvo i en Volvostad som... ... i en stor ... i en stor svartsvart Volvo och blinkade... Volvo och blinkade... ...tjuven försvinner i en ...tjuven försvinner i en stulenstulen Saab Saab ...tappat kontrollen över ...tappat kontrollen över sinsin Volvo Volvo Volvo Volvo stegsteg med 12 kronor med 12 kronor Saab Saab backadebackade med 1 peocent med 1 peocent ...gick Volvo ...gick Volvo nedned med 10 kronor... med 10 kronor... ..............
object: car
object: share
organization
......ignore infrequent cases and detailsignore infrequent cases and details
15Fefor, March 2002
FlexersFlexers Example Example
Sense1Sense1:: object, the product object, the product (vehicle) (vehicle)
Morphology:Morphology: number (singular/plural), case (nominative/genitive), definiteness
Samples:Samples: Volvon är billigare, singular, e.g. en svart Volvo ...
Corpus Analysis/Usage:Corpus Analysis/Usage:
1. Saab/Volvo Saab/Volvo NUMNUM2. Saab/VolvoSaab/Volvo NUMNUM??
((coupé|turbo|dieselcabriolet|corvette|transporter|cc|...coupé|turbo|dieselcabriolet|corvette|transporter|cc|...))3. ((GENITIVE/POSS-PRN/ARTCLGENITIVE/POSS-PRN/ARTCL)) ADJADJ/PRTCPL/PRTCPL* Saab/Volvo * Saab/Volvo NUMNUM??4. ((GENITIVE/POSS-PRN/ARTCLGENITIVE/POSS-PRN/ARTCL))? ? ADJADJ/PRTCPL/PRTCPL+ Saab/Volvo + Saab/Volvo NUMNUM??5. bilar sombilar som Saab/Volvo Saab/Volvo6. typen/kör/*köratypen/kör/*köra Saab/Volvo Saab/Volvo
>9 out of 10 cases
no rule without exception: [[Saab/VolvoSaab/Volvo TimeExpression; När Volvo 1994...] TimeExpression; När Volvo 1994...]
16Fefor, March 2002
FlexersFlexers Example Example
Sense2Sense2:: object, the object, the shareshare
Morphology:Morphology: number (singular/plural), case (nominative/genitive), definiteness
Samples:Samples: Volvon har gått upp med...
Corpus Analysis/Usage:Corpus Analysis/Usage:
1. Saab/VolvoSaab/Volvo AUX AUX?? VERB(steg/stig VERB(steg/stig**/backa/backa**)) 2. Saab/VolvoSaab/Volvo AUX AUX?? VERB(öka VERB(öka**/minska*)/minska*)?? med NUM procent med NUM procent 3. Saab/Volvo Saab/Volvo gick (tillbaka kraftigt|mot strömmen|upp|ned)gick (tillbaka kraftigt|mot strömmen|upp|ned) 4. Saab/Volvo Saab/Volvo NUMNUM procent procent
Rest of cases? Sense3 the building <not found>
Rest of cases? Sense4 the organization
17Fefor, March 2002
FlexersFlexers Example Example
CAR_TYPECAR_TYPE ((SaabSaab||VolvoVolvo||FordFord||......))/NP.../NP...VERBVERB ((stigastiga||stigerstiger||stigitstigit||stegsteg||backa[^/ ]+backa[^/ ]+||...)/(VMISA...)/(VMISA||
VMU0AVMU0A||......))AUX_VERBAUX_VERB [^/ ]+/[^/ ]+/((VTISAVTISA||VTU0AVTU0A||......))MCMC [0-9][0-9][0-9][0-9]??[0-9][0-9]??/MC/MC||[0-9][0-9][0-9][0-9]??[.,][0-9][0-9][.,][0-9][0-9]??/MC/MCSPACESPACE [ \t]+[ \t]+
{{CAR_TYPECAR_TYPE}{}{SPACESPACE}({}({AUX_VERBAUX_VERB}{}{SPACESPACE})?{})?{VERBVERB}(}(”med/S ”{MC}”med/S ”{MC}{SPACE}procent{SPACE}procent)?)? {{tag-as-sense2;tag-as-sense2;}}
{{CAR_TYPECAR_TYPE}{}{SPACESPACE}{}{MCMC}{}{SPACESPACE}}procentprocent {{tag-as-sense2;tag-as-sense2;}}{{CAR_TYPECAR_TYPE}{}{SPACESPACE}}gickgick{{SPACESPACE}(}(”tillbaka/ kraftigt””tillbaka/ kraftigt”||”mot/S ”mot/S
strömm”strömm”||”upp/””upp/”||”ned/””ned/”)) {{tag-as-sense2;tag-as-sense2;}}
18Fefor, March 2002
SUC-2SUC-2
The second version of SUC has been semi-The second version of SUC has been semi-automaticallyautomatically???? annotated with ”NAMES” annotated with ”NAMES”
15131 PERSON 15131 PERSON 8771 PLACE8771 PLACE 6309 INST6309 INST 1887 WORK1887 WORK 638 PRODUCT638 PRODUCT 540 OTHER540 OTHER 364 ANIMAL364 ANIMAL 280 MYTH280 MYTH 245 EVENT245 EVENT 242 FORMULA242 FORMULA
Här har <NAME TYPE=ANIMAL>Nalle </NAME> frukosterat...
...ber <NAME TYPE=MYTH>Herren </NAME> välsigna vår...
...årsmöte i <NAME TYPE=OTHER> Kristiansborgskyrkan</NAME>…
...till nitrat ( <DISTINCT TYPE=FORMULA> NO3-</DISTINCT> ) och därefter...
19Fefor, March 2002
POS Taggers & TagsetPOS Taggers & Tagset
Three off-the-shelf POS taggers have been downloaded Three off-the-shelf POS taggers have been downloaded and are currently under development with our new and are currently under development with our new tagsettagset
TreeTagger: HMM + Decision TreesTreeTagger: HMM + Decision Trees
TnT: Viterbi (HMM)TnT: Viterbi (HMM)
Brills: Brills: TTransformation-basedransformation-based
NER is a complex of different tasks; POS tagging is a basicNER is a complex of different tasks; POS tagging is a basictask which can aid the task which can aid the detection of entities detection of entities
20Fefor, March 2002
POS Taggers & TagsetPOS Taggers & Tagset
The NER will be/is applied on part-of-speech The NER will be/is applied on part-of-speech annotated material. The relevant tags for marking annotated material. The relevant tags for marking proper nouns (as found in the training corpus-SUC2):proper nouns (as found in the training corpus-SUC2):
NPNSNDNPNSND ...i Europa/...i Europa/NPNSNDNPNSND har inte... har inte...
NPNSGDNPNSGD ...för Litauens/...för Litauens/NPNSGDNPNSGD parlament där... parlament där...
NPUSNDNPUSND ...berättar Torgny/...berättar Torgny/NPUSNDNPUSND Lindgren/... Lindgren/...
NPUSGDNPUSGD ...är Mona Eliassons/...är Mona Eliassons/NPUSGDNPUSGD recept... recept...
NP*SNDNP*SND Ulf Norrman vann H-43/Ulf Norrman vann H-43/NP*SNDNP*SND......
XFXF ……vunnit en Grand/vunnit en Grand/XFXF Slam/ Slam/XFXF......
YY ...ÖB/...ÖB/YY under kriget i Libanon... under kriget i Libanon...
21Fefor, March 2002
Explore JAPE&GATE2Explore JAPE&GATE2
Java Annotation Pattern Engine (JAPE) GrammarJava Annotation Pattern Engine (JAPE) Grammar– Set of rulesSet of rules
» LHS regular expression over annotationsLHS regular expression over annotations» RHS annotations to be addedRHS annotations to be added» PriorityPriority» Left and Right context around the patternLeft and Right context around the pattern
– Rules are compiled in a FST over annotationsRules are compiled in a FST over annotations
22Fefor, March 2002
JAPE RulesJAPE Rules
Rule: Location1Rule: Location1Priority: 25Priority: 25(( (({Lookup.majorType==loc_key,Lookup.minorType==pre}{SpaceToken}{Lookup.majorType==loc_key,Lookup.minorType==pre}{SpaceToken})?)? {Lookup.majorType=={Lookup.majorType==locationlocation}}(({SpaceToken}{SpaceToken} {Lookup.majorType=={Lookup.majorType==loc_keyloc_key,Lookup.minorType==,Lookup.minorType==postpost}})?)? )):locName --> :locName.Location={kind=”:locName --> :locName.Location={kind=”locationlocation”,rule=”Location1”}”,rule=”Location1”}
ChinaChina seasea locationlocation
23Fefor, March 2002
Plan for (Plan for (the rest ofthe rest of) 2002) 2002 January-AprilJanuary-April: inventory of existing L&A resources;: inventory of existing L&A resources;
re-training of pos-taggers with språkdatas tagset;re-training of pos-taggers with språkdatas tagset;localization, ’completion’& structuring of L-resources;localization, ’completion’& structuring of L-resources;provision of (draft) guidelines for the NER task; provision of (draft) guidelines for the NER task; working with ’WORK&ART’ and ’EVENTS’;working with ’WORK&ART’ and ’EVENTS’;
May-SeptemberMay-September: implementations; porting of old : implementations; porting of old scripts to the current state-of-affairs; SUC2 with ML?; scripts to the current state-of-affairs; SUC2 with ML?; developing a Swedish JAPE module in GATE2 developing a Swedish JAPE module in GATE2
OctoberOctober: evaluation: evaluation NovemberNovember: new web-interface and GATE2 integration: new web-interface and GATE2 integration DecemberDecember: wrapping-upp: wrapping-upp
24Fefor, March 2002
Annotation GuidelinesAnnotation Guidelines
FFirst draft specifications for the creation of simple irst draft specifications for the creation of simple guidelines for the NER work as applied on Swedish guidelines for the NER work as applied on Swedish datadata have been written have been written
IIdeas from MUC, ACE and deas from MUC, ACE and ownown experience experienceThe guidelines are expected to evolve during the course The guidelines are expected to evolve during the course
of the project, refined and extendedof the project, refined and extendedThe purpose of the guidelines is to try and impose some The purpose of the guidelines is to try and impose some
consistency measures for annotation and evaluation, consistency measures for annotation and evaluation, and and giving the potential future users of the system a giving the potential future users of the system a clearer picture of what the recognition components can clearer picture of what the recognition components can offeroffer
Pragmatic rather than theoretic...Pragmatic rather than theoretic...
25Fefor, March 2002
Guidelines cont’dGuidelines cont’d
Named Entity Recognition (NER) consists of a number Named Entity Recognition (NER) consists of a number of subtasksof subtasks,, correspond correspondinging to a number of XML tag to a number of XML tag elementselements
The only insertions allowed during tagging are tags The only insertions allowed during tagging are tags enclosed in angled brackets. No extra white space or enclosed in angled brackets. No extra white space or carriage returns are to be insertedcarriage returns are to be inserted
The markup will have the form of the entity type and The markup will have the form of the entity type and attribute information:attribute information:<ELEMENT-NAME ATTR-NAME="ATTR-VALUE"><ELEMENT-NAME ATTR-NAME="ATTR-VALUE">a a text-text-
stringstring</ELEMENT-NAME></ELEMENT-NAME>Six (+1) categories will be recognized Six (+1) categories will be recognized
26Fefor, March 2002
““PLACE” NAMESPLACE” NAMES
<ENAMEX TYPE=”G-PLC”><ENAMEX TYPE=”G-PLC”>; ; DescriptionDescription: a (natural) : a (natural) geographically/geologically or astronomically defined location, geographically/geologically or astronomically defined location, with physical extent; such as bodies of water, rivers, mountains, with physical extent; such as bodies of water, rivers, mountains, geological formations, islands, continents, stars, galaxies, …geological formations, islands, continents, stars, galaxies, …
<ENAMEX TYPE=”P-PLC”><ENAMEX TYPE=”P-PLC”>; ; DescriptionDescription: (geo-political entities) : (geo-political entities) politically defined geographical regionspolitically defined geographical regions; ; nations, states, cities, nations, states, cities, villages, provinces, regions, villages, provinces, regions, other other populated urban areapopulated urban areass …)…); e.g.,; e.g., the capital city is used to refer to the nation’s governmentthe capital city is used to refer to the nation’s government e.g. e.g. USA attackerade XUSA attackerade X;;
<ENAMEX TYPE=”F-PLC”><ENAMEX TYPE=”F-PLC”>; ; DescriptionDescription: facility entities which are : facility entities which are (permanent) man-made artefacts falling (permanent) man-made artefacts falling under the domains of under the domains of architecture, transportation infrastructure and civil engineeringarchitecture, transportation infrastructure and civil engineering;; such as such as streets, parks, stadiums, airports, ports, museums, streets, parks, stadiums, airports, ports, museums, tunnels, bridges,…tunnels, bridges,…
27Fefor, March 2002
““PERSON” NAMESPERSON” NAMES
<ENAMEX TYPE=”H-PRS”><ENAMEX TYPE=”H-PRS”>;; DescriptionDescription: person entities are: person entities are
limited to humans, fictional human characters appearing in TV,limited to humans, fictional human characters appearing in TV,
movies etc.movies etc.; c; christian, hristian, ffamily names, amily names, nnicknames,icknames, group names, group names, tribes,tribes,……
<ENAMEX TYPE=”O-PRS”><ENAMEX TYPE=”O-PRS”>; ; DescriptionDescription: Saints, gods, names of : Saints, gods, names of animals and pets,…animals and pets,…
e.g. Herren, Gud, Athena, Ior,...e.g. Herren, Gud, Athena, Ior,...
28Fefor, March 2002
““ORGANIZATION” ORGANIZATION” NAMESNAMES
<ENAMEX TYPE=”C-ORG”><ENAMEX TYPE=”C-ORG”>;; DescriptionDescription: organization : organization entities are divided into two categories; theentities are divided into two categories; the first is first is limited to limited to commercial commercial corporations, multinational corporations, multinational organizations, tv-channelsorganizations, tv-channels,,…(both multiword and single …(both multiword and single word entities)word entities)
<ENAMEX TYPE=”G-ORG”><ENAMEX TYPE=”G-ORG”>; ; DescriptionDescription: organization : organization entities of the second groups are limited toentities of the second groups are limited to governmental and non-profit organizations such as governmental and non-profit organizations such as political parties, governmental bodies at any level of political parties, governmental bodies at any level of importance, political groups, non-profit organizations, importance, political groups, non-profit organizations, unions, universities, embassies, army…unions, universities, embassies, army… (sport teams, (sport teams, music groups, stock exchanges, orchestras, music groups, stock exchanges, orchestras, churches,churches,......)?)?
29Fefor, March 2002
““EVENT” NAMESEVENT” NAMES
<ENAMEX TYPE=”EVN”><ENAMEX TYPE=”EVN”>;; DescriptionDescription: Historical, : Historical, sports, festivals, races, sports, festivals, races, War and Peace War and Peace eventsevents (Battles), conferences, Christmas, holidays(Battles), conferences, Christmas, holidays
e.g. formel-1, andra världskriget, Julitrav, VM, e.g. formel-1, andra världskriget, Julitrav, VM, OS, Mittmässan, elitserien, ...OS, Mittmässan, elitserien, ...
Open category; orthography might not be Open category; orthography might not be enough...enough...
30Fefor, March 2002
““WORK/ART” NAMESWORK/ART” NAMES
<ENAMEX TYPE=”WRK”><ENAMEX TYPE=”WRK”>;; DescriptionDescription: This is one of the : This is one of the most difficult categories since a work or art name is most difficult categories since a work or art name is usually comprised by tokens that are seldom proper usually comprised by tokens that are seldom proper nouns. Titles of books, films, songs, artwork, nouns. Titles of books, films, songs, artwork, paintings, tv-programs, magazines, newspapers, …paintings, tv-programs, magazines, newspapers, …
e.g. e.g. X sjöng X sjöng ““Barnens visaBarnens visa””Ett fotografi med Ett fotografi med titelntiteln Galna turister visar en Galna turister visar en
gatumarknad i Brasiliengatumarknad i Brasilien
Open category; long chains; orthography is not enough...Open category; long chains; orthography is not enough...
31Fefor, March 2002
““OBJECT” NAMESOBJECT” NAMES
<ENAMEX TYPE=”OBJ”><ENAMEX TYPE=”OBJ”>;; DescriptionDescription: ships, : ships, machines, artefacts, products, diseases/prizes machines, artefacts, products, diseases/prizes named after people, named after people, boatsboats, …, …
e.g. e.g. fartyget Miriam, Alzheimers fartyget Miriam, Alzheimers sjukdomsjukdom
32Fefor, March 2002
Tool Comparison-1 (IE)Tool Comparison-1 (IE)
INFORMATION EXTRACTION SYSTEMS N
am
ed
En
titi
es
No
min
al
En
titi
es
No
rma
lize
d
Tim
e
Re
lati
on
s
Ev
en
ts
Mu
lti-
Lin
gu
al
Ex
ten
sib
le v
ia
Ma
ch
ine
Le
arn
ing
Ex
ten
sib
le v
ia
Pro
gra
mm
ing
COMMERCIAL COMPANIESAeroText, Lockheed Martin x x x x x EN,ES,ZH,JP x xIdentiFinder, BBN/Verizon POLTM x x x EN, ZH, (AR) xIntelligent Miner for Text, IBM x x x EN xNet Owl, SRA POLTM+ x x EN (ES, AR, Ti, JP, DE, FR, (Russian) xThing Finder, Inxight POLTM+ 0 EN,ES,ZH,JP,FR xContext, Oracle x EN xSemio Taxonomy EN,FR,ES,IT,JP,DU,(ZH,DE) xLexiQuest Mine x x EN,FR,ES,DE,DU xLingSoft x EN xCoGenTex/Cornell x x x EN xTextWise/Syracuse Univ. x x x EN x
NON-PROFIT ORGANIZATIONSAlembic, MITRE x x x x EN, ZH, ES x xGATE, U. Sheffield x x x EN xUniv. of Arizona x x EN xNew Mexico State University x x EN xFastus/TextPro, SRI International x x x x ENProteus, New York University x x x x x EN, JP, ES x xTIMEX, MITRE x EN, ES xUniv. of Massachusetts/Amherst x x x EN x xEN=English ZH=Chinese ES=Spanish JP=Japanese IT=Italian FR=French DE=German DU=Dutch AR=ArabicP-People, O=Organization, L=Location, T=Time, M=Money
Screenshot taken fr. Mark Maybury
INFORMATIONEXTRACTIONSYSTEMS
33Fefor, March 2002
Entity Extraction Tools – Entity Extraction Tools – Commercial Vendors Commercial Vendors 020204020204
AeroText - Lockheed Martin's AeroText & trade;AeroText - Lockheed Martin's AeroText & trade;– www.lockheedmartin.com/factsheets/product589.htmlwww.lockheedmartin.com/factsheets/product589.html
BBN's Identifinder: BBN's Identifinder: www.bbn.com/speech/identifinder.htmlwww.bbn.com/speech/identifinder.html IBM's Intelligent Miner for TextIBM's Intelligent Miner for Text
– www-4.ibm.com/software/data/iminer/fortext/index.htmlwww-4.ibm.com/software/data/iminer/fortext/index.html SRA NetOwl: SRA NetOwl: www.netowl.comwww.netowl.com Inxight's ThingFinderInxight's ThingFinder
– www.inxight.com/products/thing_finder/www.inxight.com/products/thing_finder/ Semio taxonomies:Semio taxonomies: www.semio.com www.semio.com Context: Context: technet.oracle.com/products/oracle7/context/tutorialtechnet.oracle.com/products/oracle7/context/tutorial// LexiQuest Mine: LexiQuest Mine: www.lexiquest.comwww.lexiquest.com Lingsoft: Lingsoft: www.lingsoft.fiwww.lingsoft.fi CoGenTex: CoGenTex: www.cogentex.comwww.cogentex.com TextWise: TextWise: www.textwise.comwww.textwise.com & &
www.infonortics.com/searchengines/boston1999/arnold/sld001.www.infonortics.com/searchengines/boston1999/arnold/sld001.htmhtm
34Fefor, March 2002
Entity Extraction Tools – Entity Extraction Tools – Non-Profit Organizations Non-Profit Organizations
MITRE’s Alembic extraction system and Alembic Workbench MITRE’s Alembic extraction system and Alembic Workbench annotation tool: annotation tool: www.www.mitremitre.org/technology/.org/technology/nlpnlp
Univ. of Sheffield’s GATE: Univ. of Sheffield’s GATE: gate.ac.ukgate.ac.uk Univ. of Arizona: Univ. of Arizona: ai.bpa.arizona.eduai.bpa.arizona.edu New Mexico State University (Tabula Rasa system): New Mexico State University (Tabula Rasa system):
http://crl.nmsu.edu/Research/Projects/tr/index.htmlhttp://crl.nmsu.edu/Research/Projects/tr/index.html SRI Internationals Fastus/TextPro:SRI Internationals Fastus/TextPro:
– www.ai.sri.com/~appelt/fastus.htmlwww.ai.sri.com/~appelt/fastus.html– www.ai.sri.com/~appelt/TextProwww.ai.sri.com/~appelt/TextPro (not free since Jan 2002!) (not free since Jan 2002!)
New York University’s ProteusNew York University’s Proteus– www.cs.nyu.edu/cs/projects/proteuswww.cs.nyu.edu/cs/projects/proteus//
University of Massachusetts (Badger and Crystal):University of Massachusetts (Badger and Crystal):– www-nlp.cs.umass.eduwww-nlp.cs.umass.edu//
35Fefor, March 2002
Name Analysis SoftwareName Analysis Software
Language Analysis Systems Inc.’s (Herndon, VA) Language Analysis Systems Inc.’s (Herndon, VA) “Name Reference Library” www.las-inc.com & “Name Reference Library” www.las-inc.com & www.onomastix.com/www.onomastix.com/
Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Supports analysis of Arabic, Hispanic, Chinese, Thai, Russian, Korean, and Indonesian names; others in future versions... Korean, and Indonesian names; others in future versions...
Product Features:Product Features:– Identifying the cultural classification of a person name– Given a name, provides common variants on that name, e.g., “Abd Al
Rahman” or “Abdurrahman” or ... – Implied gender– Identifies title, affixes, qualifiers, e.g.,
"Bin," means "son of" as in Osama Bin Laden– List top countries where name occurs
Cost: $3,535 a copy and a $990 annual fee !Cost: $3,535 a copy and a $990 annual fee !
36Fefor, March 2002
Example 1: IBM’s Example 1: IBM’s Intelligent MinerIntelligent Miner
See: www-4.ibm.com/software/data/iminer/fortext/index.html
39Fefor, March 2002
SomeSome Relevant Projects Relevant Projects
ACE: Automated Content ExtractionACE: Automated Content Extraction((www.www.nistnist..govgov/speech/tests/ace/speech/tests/ace))
NIST: National Institure of Standards and TechnologiesNIST: National Institure of Standards and Technologies((http://www.http://www.itlitl..nistnist..govgov//iauiiaui/894.02/related_projects//894.02/related_projects/mucmuc/index.html/index.html); +evaluation tools); +evaluation tools
TIDES: Translingual Information Detection Extraction and TIDES: Translingual Information Detection Extraction and Summarization; DARPA; multilingual name extraction (Summarization; DARPA; multilingual name extraction (www.www.darpadarpa.mil/.mil/itoito/research/tides/research/tides))
MUSE: MUSE: AA MUlti-Source Entity finder MUlti-Source Entity finder ((http://www.dcs.shef.ac.uk/~hamish/muse.htmlhttp://www.dcs.shef.ac.uk/~hamish/muse.html))
Identifying Named Entities in Speech Identifying Named Entities in Speech (HUB)(HUB) Other... Other...
40Fefor, March 2002
Tool Comparison-2 (DC,TM...)Tool Comparison-2 (DC,TM...)
Document Clustering, Mining, Topic Detection, and Visualization Systems A
ll w
ord
s eq
ual
ly
No
un
Ph
rase
s
Nam
ed E
nti
ties
Acc
epts
Pre
def
ined
T
erm
s
Pre
def
ined
T
axo
no
mie
s
Gen
erat
es
Tax
on
om
ies
Mu
ltiL
ing
ual
?
Sto
ry
Seg
men
tati
on
New
To
pic
D
etec
tio
n?
To
pic
Tra
ckin
g
(Pre
de
fin
ed
To
pic
s)
Inxight Categorizer, Tree Studio, Inxight x xEN, FR, ES, DE, DU,
… (12) x
Semio Taxonomy x x x xEN,FR,ES,IT,JP,DU (ZH,DE)
LexiQuest Mine x x x x EN, FR, ES, DE, DU xInterMedia Text, Oracle x EN xNorthernLight x EN x xAutonomy x EN xLotus Discovery Server (LDS), Lotus x x EN xQKS Classifier, Quiver x x x EN xFulcrum Knowledge Server, Hummingbird x x x EN
SPIRE/Themeview, PNNL x x x EN
VantagePoint, Search Technology Inc. x x x EN
Mohomine, Inc. x x EN xIntelligent Miner for Text, IBM x x EN x x xOasis, OnTopic, BBN/Verizon x x EN, ZH, AR x x xEN=English ZH=Chinese ES=Spanish JP=Japanese DE=German DU=Dutch FR=French AR=Arabic IT=Italian
Document Clustering, Mining, Topic Detection, and Visualization Systems
Screenshot taken fr. Mark Maybury
41Fefor, March 2002
EvaluationEvaluation
Evaluation consists of (at least) three parts:Evaluation consists of (at least) three parts:– Entity DetectionEntity Detection (of the string that names an (of the string that names an
entity): entity): <ENAMEX><ENAMEX>FjärranFjärran ÖsternÖstern</ENAMEX></ENAMEX>– Attribute Recognition/ClassificationAttribute Recognition/Classification (of the (of the
entity); entity); <ENAMEX TYPE=“LOCATION”><ENAMEX TYPE=“LOCATION”>FjärranFjärran ÖsternÖstern</ENAMEX></ENAMEX>
– Extent Recognition Extent Recognition (measure the ability of a (measure the ability of a system to correctly determine an entity’s system to correctly determine an entity’s extentextent partial correctness): partial correctness): Fjärran <ENAMEX TYPE=“LOCATION”> <ENAMEX TYPE=“LOCATION”>ÖsternÖstern</ENAMEX></ENAMEX>
42Fefor, March 2002
Evaluation cont’dEvaluation cont’d
Systems exist that identify names ~90-95% accurately Systems exist that identify names ~90-95% accurately in newswire texts (in several languages)in newswire texts (in several languages)
Metrics: Metrics: VaryVary from test case to test case; the from test case to test case; the “simplest” definitions are:“simplest” definitions are:
PrecisionPrecision = #CorrectReturned/#TotalReturned = #CorrectReturned/#TotalReturned
RecallRecall = #CorrectReturned/#CorrectPossible = #CorrectReturned/#CorrectPossibleQuite high figures in P&R can be found in the Quite high figures in P&R can be found in the litterature based exclusively on these litterature based exclusively on these simplersimpler metrics... metrics...
Almost non-existent discussion on metonymy or other Almost non-existent discussion on metonymy or other difficult cases makes the results suspect?!difficult cases makes the results suspect?!
43Fefor, March 2002
Evaluation cont’dEvaluation cont’d
Guidelines for more rigid evaluation criteria have been imposed by the MUC; e.g. Guidelines for more rigid evaluation criteria have been imposed by the MUC; e.g. Precision = Correct + Precision = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )
ActualActualCorrect:Correct: two single fills are considered identical two single fills are considered identical Partially Correct:Partially Correct: two single fills are not identical, but partial credit should still be given two single fills are not identical, but partial credit should still be givenActual = Correct + Incorrect + Partially Correct + SpuriousActual = Correct + Incorrect + Partially Correct + SpuriousSpurious:Spurious: a response object has no key object aligned with it a response object has no key object aligned with it Recall = Correct + Recall = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )
PossiblePossible See: See: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
muc_sw/muc_sw_manual.htmlmuc_sw/muc_sw_manual.html
44Fefor, March 2002
Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)
See: See: http://www.gksoft.com/govt/http://www.gksoft.com/govt/
181 govermentalorgs for Norway
45Fefor, March 2002
Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)
See: See: http://www.odci.gov/cia/publications/factbook/index.htmlhttp://www.odci.gov/cia/publications/factbook/index.html
46Fefor, March 2002
Resource Localization Resource Localization (Organizations: Govermental)(Organizations: Govermental)
See: See: http://www.odci.gov/cia/publications/factbook/index.htmlhttp://www.odci.gov/cia/publications/factbook/index.html
47Fefor, March 2002
Resource Localization Resource Localization (Organizations: Publishers)(Organizations: Publishers)
See: See: http://www.http://www.netlibrary.comnetlibrary.com
500 publ.
48Fefor, March 2002
Resource Localization Resource Localization (Locations: Countries)(Locations: Countries)
See: See: http://www.http://www.reseguide.sereseguide.se
184 countries
49Fefor, March 2002
Resource Localization Resource Localization (Locations: Cities)(Locations: Cities)
www.calle.com
50Fefor, March 2002
Problems: MetonymyProblems: Metonymy
a speaker uses a reference to one entity to refer to another a speaker uses a reference to one entity to refer to another entity –entity – oror entitiesentities – related to it– related to it;; ALLALL words are words are metonyms?!metonyms?!
(In ACE) Classic metonymies and composites(In ACE) Classic metonymies and composites
Reference to two entities, one explicitReference to two entities, one explicitand one indirect reference; commonly thisand one indirect reference; commonly thisis the case of capital city names standing inis the case of capital city names standing infor national govermentsfor national goverments
Apply to GPEs, typically having a goverment, a populate, a geographic location and an abstract notion of statehood
51Fefor, March 2002
Problems: DCA?Problems: DCA?
The The DCA DCA approach approach might not work for some of the NE might not work for some of the NE categoriescategories that are long and mentioned only once; that are long and mentioned only once; particularlyparticularly EVENTSEVENTS, , ARTWORKARTWORK, …, …
In these cases context sensitive grammars might be the In these cases context sensitive grammars might be the alternative; alternative; They work fairly well for novel entities They work fairly well for novel entities and rules can be created by hand or learned via and rules can be created by hand or learned via machine learning or statistical algorithmsmachine learning or statistical algorithms
example....example....
52Fefor, March 2002
Rules that capture local patterns that characterize Rules that capture local patterns that characterize entities, from instances of annotated training data or entities, from instances of annotated training data or semi-automatic analysis of corpora:semi-automatic analysis of corpora:
– XXXXXX köpte köpte YYYYYY: : XXXXXX and and YYYYYY are with very high probability organizationsare with very high probability organizations
EMI köpte Virgin_Music_GroupEMI köpte Virgin_Music_GroupGrundin köpte HornlineGrundin köpte HornlineMoyne köpte TrustorMoyne köpte TrustorOptiroc köpte StråbrukenOptiroc köpte StråbrukenPandox köpte Park_Avenue_HotelPandox köpte Park_Avenue_HotelSF köpte EuropafilmSF köpte EuropafilmStagecoach köpte SwebusStagecoach köpte SwebusTrelleborg köpte Intertrade Trelleborg köpte Intertrade
53Fefor, March 2002
DCA more problems...DCA more problems...
<Dagens Indutri 020306 s.18><Dagens Indutri 020306 s.18>
FordsFords VD och delägare Bill VD och delägare Bill FordFord stal showen från Volvo PV när stal showen från Volvo PV när bilsalongen i Genève... bilsalongen i Genève... FordFord köpte Volvo Personvagnar 1999....På köpte Volvo Personvagnar 1999....På FordsFords egen presskonferens betonade Bill egen presskonferens betonade Bill FordFord att Volvo... att Volvo...
<Dagens Indutri 020306 s.22><Dagens Indutri 020306 s.22>
Indutri- och finansmannen Indutri- och finansmannen Carl BennetCarl Bennet, via sitt bolag , via sitt bolag CarlCarl BennetBennet AB, AB, börsnoterade...börsnoterade...Carl Bennet Carl Bennet framhåller att...framhåller att...
54Fefor, March 2002
Some Final RemarksSome Final Remarks
A challenge with NER is creating a stable definitionA challenge with NER is creating a stable definitionof what an entity is and creating a taxonomy of entities of what an entity is and creating a taxonomy of entities to map to...to map to...
Having done that it becomes simpler to solve Having done that it becomes simpler to solve metonymy and other ambiguity problems...metonymy and other ambiguity problems...
Problems remain; where shall we draw the entity Problems remain; where shall we draw the entity boundaries?boundaries?
Text format...Text format...
Shall we just go for it or try and Shall we just go for it or try and rationalizerationalize the entity the entity types?types?
time will show...time will show...