![Page 1: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/1.jpg)
Metadata Extraction: Human Language Technology and the
Semantic Web
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish CunninghamKalina BontchevaValentin TablanDiana Maynard
SEKT meeting, London, 21 January 2004
![Page 2: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/2.jpg)
2(109)
The Knowledge Economy and Human Language
Gartner, December 2002:
• taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications
• through 2012 more than 95% of human-to-computer information input will involve textual language
A contradiction: formal knowledge in semantics-based systems vs. ambiguous informal natural language
The challenge: to reconcile these two opposing tendencies
![Page 3: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/3.jpg)
3(109)
HumanLanguage
Formal Knowledge(ontologies andinstance bases)
(A)IE
CLIE
(M)NLG
ControlledLanguage
OIE
SemanticWeb; Semantic Grid;Semantic Web Services
KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE
HLT and Knowledge: Closing the Language Loop
![Page 4: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/4.jpg)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 5: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/5.jpg)
5(109)
Information Extraction• Information Extraction (IE) pulls facts
and structured information from the content of large text collections.
• Contrast IE and Information Retrieval• NLP history: from NLU to IE • Progress driven by quantitative
measures• MUC: Message Understanding
Conferences • ACE: Automatic Content Extraction
![Page 6: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/6.jpg)
6(109)
MUC-7 tasks
Held in 1997, around 15 participants inc. 2 UK. Broke IE down into component tasks:
• NE: Named Entity recognition and typing
• CO: co-reference resolution • TE: Template Elements • TR: Template Relations • ST: Scenario Templates
![Page 7: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/7.jpg)
7(109)
An Example
The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.
• NE: entities are "rocket", "Tuesday", "Dr. Head" and "We Build Rockets"
• CO: "it" refers to the rocket; "Dr. Head" and "Dr. Big Head" are the same
• TE: the rocket is "shiny red" and Head's "brainchild".
• TR: Dr. Head works for We Build Rockets Inc.
• ST: a rocket launching event occurred with the various participants.
![Page 8: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/8.jpg)
8(109)
Performance levels
• Vary according to text type, domain, scenario, language
• NE: up to 97% (tested in English, Spanish, Japanese, Chinese)
• CO: 60-70% resolution • TE: 80% • TR: 75-80% • ST: 60% (but: human level may be
only 80%)
![Page 9: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/9.jpg)
9(109)
What are Named Entities?
• NE involves identification of proper names in texts, and classification into a set of predefined categories of interest
• Person names• Organizations (companies, government
organisations, committees, etc)• Locations (cities, countries, rivers, etc)• Date and time expressions
![Page 10: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/10.jpg)
10(109)
What are Named Entities (2)
• Other common types: measures (percent, money, weight etc), email addresses, Web addresses, street addresses, etc.
• Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
• MUC-7 entity definition guidelines [Chinchor’97]
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ne_task.html
![Page 11: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/11.jpg)
11(109)
What are NOT NEs (MUC-7)
• Artefacts – Wall Street Journal• Common nouns, referring to named entities –
the company, the committee • Names of groups of people and things named
after people – the Tories, the Nobel prize• Adjectives derived from names – Bulgarian,
Chinese• Numbers which are not times, dates,
percentages, and money amounts
![Page 12: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/12.jpg)
12(109)
Basic Problems in NE
• Variation of NEs – e.g. John Smith, Mr Smith, John.
• Ambiguity of NE types: John Smith (company vs. person) – May (person vs. month) – Washington (person vs. location) – 1945 (date vs. time)
• Ambiguity with common words, e.g. "may"
![Page 13: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/13.jpg)
13(109)
More complex problems in NE
• Issues of style, structure, domain, genre etc.
• Punctuation, spelling, spacing, formatting, ... all have an impact:
Dept. of Computing and MathsManchester Metropolitan UniversityManchesterUnited Kingdom
Tell me more about Leonardo
Da Vinci
![Page 14: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/14.jpg)
14(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 15: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/15.jpg)
15(109)
Corpora and System Development• Corpora are divided typically into a training and
testing portion • Rules/Learning algorithms are trained on the
training part• Tuned on the testing portion in order to optimise
– Rule priorities, rules effectiveness, etc.– Parameters of the learning algorithm and the features
used• Evaluation set – the best system configuration is
run on this data and the system performance is obtained
• No further tuning once evaluation set is used!
![Page 16: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/16.jpg)
16(109)
Some NE Annotated Corpora
• MUC-6 and MUC-7 corpora - English• CONLL shared task corpora
http://cnts.uia.ac.be/conll2003/ner/ - NEs in English and German
http://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish and Dutch
• TIDES surprise language exercise (NEs in Cebuano and Hindi)
• ACE – English - http://www.ldc.upenn.edu/Projects/ACE/
![Page 17: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/17.jpg)
17(109)
The MUC-7 corpus• 100 documents in SGML • News domain
Named Entities:• 1880 Organizations (46%)• 1324 Locations (32%)• 887 Persons (22%)• Inter-annotator agreement very high (~97%)• http://www.itl.nist.gov/iaui/894.02/related_project
s/muc/proceedings/muc_7_proceedings/marsh_slides.pdf
![Page 18: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/18.jpg)
18(109)
The MUC-7 Corpus (2)
<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>, <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX> <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied the space shuttle Endeavour for launch on a Japanese satellite retrieval mission.
<p>Endeavour, with an international crew of six, was set to blast off from
the <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-minute launching period. The <TIMEX TYPE="DATE">nine day</TIMEX> shuttle flight was to be the 12th launched in darkness.
![Page 19: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/19.jpg)
19(109)
ACE – Towards Semantic Tagging of Entities
• MUC NE tags segments of text whenever that text represents the name of an entity
• In ACE (Automated Content Extraction), these names are viewed as mentions of the underlying entities. The main task is to detect (or infer) the mentions in the text of the entities themselves
• Rolls together the NE and CO tasks• Domain- and genre-independent approaches• ACE corpus contains newswire, broadcast news
(ASR output and cleaned), and newspaper reports (OCR output and cleaned)
![Page 20: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/20.jpg)
20(109)
ACE Entities
• Dealing with – Proper names – e.g., England, Mr. Smith, IBM– Pronouns – e.g., he, she, it– Nominal mentions – the company, the spokesman
• Identify which mentions in the text refer to which entities, e.g., – Tony Blair, Mr. Blair, he, the prime minister, he– Gordon Brown, he, Mr. Brown, the chancellor
![Page 21: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/21.jpg)
21(109)
ACE Example <entity ID="ft-airlines-27-jul-2001-2" GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> </entity_mention> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> </entity_mention> <entity_mention ID="M005" TYPE = "PRO" string = "its"> </entity_mention> <entity_mention ID="M006" TYPE = "NAME" string = "Nats"> </entity_mention> </entity>
![Page 22: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/22.jpg)
22(109)
Annotation Tools: Alembic, GATE, ...
![Page 23: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/23.jpg)
23(109)
Performance Evaluation
• Evaluation metric – mathematically defines how to measure the system’s performance against a human-annotated, gold standard
• Scoring program – implements the metric and provides performance measures – For each document and over the entire
corpus– For each type of NE
![Page 24: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/24.jpg)
24(109)
The Evaluation Metric
• Precision = correct answers/answers produced
• Recall = correct answers/total possible correct answers
• Trade-off between precision and recall • F-Measure = (β2 + 1)PR / β2R + P
[van Rijsbergen 75]• β reflects the weighting between precision
and recall, typically β=1
![Page 25: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/25.jpg)
25(109)
The Evaluation Metric (2)• We may also want to take account of
partially correct answers:• Precision =
Correct + ½ Partially correctCorrect + Incorrect + Partial
• Recall = Correct + ½ Partially correctCorrect + Missing + Partial
• Why: NE boundaries are often misplaced, sosome partially correct results
![Page 26: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/26.jpg)
26(109)
The GATE Evaluation Tool
![Page 27: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/27.jpg)
27(109)
Corpus-level Regression Testing
• Need to track system’s performance over time
• When a change is made to the system we want to know what implications are over the entire corpus
• Why: because an improvement in one case can lead to problems in others
• GATE offers automated tool to help with the NE development task over time
![Page 28: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/28.jpg)
28(109)
Regression Testing (2)At corpus level – GATE’s corpus benchmark tool – tracking system’s performance over time
![Page 29: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/29.jpg)
29(109)
Challenge:Evaluating Richer NE Tagging
• Need for new metrics when evaluating hierarchy/ontology-based NE tagging
• Need to take into account distance in the hierarchy
• Tagging a company as a charity is less wrong than tagging it as a person
![Page 30: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/30.jpg)
30(109)
SW IE Evaluation tasks• Detection of entities and events, given a target
ontology of the domain.• Disambiguation of the entities and events from the
documents with respect to instances in the given ontology. For example, measuring whether the IE correctly disambiguated “Cambridge” in the text to the correct instance: Cambridge, UK vs Cambridge, MA.
• Decision when a new instance needs to be added to the ontology, because the text contains a new instance, that does not already exist in the ontology.
![Page 31: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/31.jpg)
31(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 32: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/32.jpg)
32(109)
Two kinds of IE approaches
Knowledge Engineering
• rule based • developed by experienced
language engineers • make use of human
intuition • requires only small amount
of training data• development could be very
time consuming • some changes may be
hard to accommodate
Learning Systems
• use statistics or other machine learning
• developers do not need LE expertise
• requires large amounts of annotated training data
• some changes may require re-annotation of the entire training corpus
• annotators are cheap (but you get what you pay for!)
![Page 33: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/33.jpg)
33(109)
NE Baseline: list lookup approach
• System that recognises only entities stored in its lists (gazetteers).
• Advantages - Simple, fast, language independent, easy to retarget (just create lists)
• Disadvantages – impossible to enumerate all names, collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity
![Page 34: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/34.jpg)
34(109)
Shallow parsing approach using internal structure
• Internal evidence – names often have internal structure. These components can be either stored or guessed, e.g. location:
• Cap. Word + {City, Forest, Center, River}
• e.g. Sherwood Forest
• Cap. Word + {Street, Boulevard, Avenue, Crescent, Road}
• e.g. Portobello Street
![Page 35: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/35.jpg)
35(109)
Problems ...
• Ambiguously capitalised words (first word in sentence)[All American Bank] vs. All [State Police]
• Semantic ambiguity"John F. Kennedy" = airport (location) "Philip Morris" = organisation
• Structural ambiguity [Cable and Wireless] vs. [Microsoft] and [Dell];[Center for Computational Linguistics] vs. message from [City Hospital] for [John Smith]
![Page 36: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/36.jpg)
36(109)
Shallow parsing with context
• Use of context-based patterns is helpful in ambiguous cases
• "David Walton" and "Goldman Sachs" are indistinguishable
• But with the phrase "David Walton of Goldman Sachs" and the Person entity "David Walton" recognised, we can use the pattern "[Person] of [Organization]" to identify "Goldman Sachs“ correctly.
![Page 37: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/37.jpg)
37(109)
Examples of context patterns
• [PERSON] earns [MONEY]• [PERSON] joined [ORGANIZATION]• [PERSON] left [ORGANIZATION]• [PERSON] joined [ORGANIZATION] as [JOBTITLE]• [ORGANIZATION]'s [JOBTITLE] [PERSON]• [ORGANIZATION] [JOBTITLE] [PERSON]• the [ORGANIZATION] [JOBTITLE]• part of the [ORGANIZATION]• [ORGANIZATION] headquarters in [LOCATION]• price of [ORGANIZATION]• sale of [ORGANIZATION]• investors in [ORGANIZATION]• [ORGANIZATION] is worth [MONEY]• [JOBTITLE] [PERSON]• [PERSON], [JOBTITLE]
![Page 38: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/38.jpg)
38(109)
Example Rule-based System - ANNIE
• Created as part of GATE
• GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging
• GATE has a finite-state pattern-action rule language, used by ANNIE
• ANNIE modified for MUC guidelines – 89.5% f-measure on MUC-7 NE corpus
![Page 39: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/39.jpg)
39(109)
NE ComponentsThe ANNIE system – a reusable and easily extendable set of components
![Page 40: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/40.jpg)
40(109)
Gazetteer lists for rule-based NE• Needed to store the indicator strings for
the internal structure and context rules:• Internal location indicators – e.g., {river,
mountain, forest} for natural locations; {street, road, crescent, place, square, …}for address locations
• Internal organisation indicators – e.g., company designators {GmbH, Ltd, Inc, …}
• Produces Lookup results of the given kind
![Page 41: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/41.jpg)
41(109)
The Named Entity Grammars
• Phases run sequentially and constitute a cascade of FSTs over the pre-processing results
• Hand-coded rules applied to annotations to identify NEs
• Annotations from format analysis, tokeniser, sentence splitter, POS tagger, and gazetteer modules
• Use of contextual information • Finds person names, locations, organisations,
dates, addresses.
![Page 42: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/42.jpg)
42(109)
NE Rule in JAPEJAPE: a Java Annotation Patterns Engine• Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components• Simplifies multi-phase regex processing Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }
![Page 43: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/43.jpg)
43(109)
Nam
ed E
ntiti
es in
GA
TE
![Page 44: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/44.jpg)
44(109)
Using co-reference to classify ambiguous NEs
• Orthographic co-reference module that matches proper names in a document
• Improves NE results by assigning entity type to previously unclassified names, based on relations with classified NEs
• May not reclassify already classified entities• Classification of unknown entities very useful for
surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]
![Page 45: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/45.jpg)
45(109)
Named Entity Coreference
![Page 46: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/46.jpg)
46(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 47: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/47.jpg)
47(109)
Machine Learning Approaches• Approaches:
– Train ML models on manually annotated text– Mixed initiative learning
• Used for producing training data• Used for producing working systems
• ML Methods– Symbolic learning: rules/decision trees
induction– Statistical models: HMMs, Bayesian methods,
Maximum Entropy
![Page 48: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/48.jpg)
48(109)
ML Terminology
• Instances (tokens, entities)Occurrences of a phenomenon
• Attributes (features)Characteristics of the instances
• ClassesSets of similar instances
![Page 49: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/49.jpg)
49(109)
Methodology
• The task can be broken into several subtasks (that can use different methods):– Boundary detection– Entity classification into NE types– Different models for different entity types
• Several models can be used in competition.– Some algorithms perform better on little data while
others are better when more training is available
![Page 50: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/50.jpg)
50(109)
Methodology (2)Boundaries (and entity types) notations
– S(-XXX), E(-XXX)<S-ORG/>U.N.<E-ORG/> official <S-PER/>Ekeus<E-PER/> heads for <S-LOC/>Baghdad<E-LOC/>.
– IOB notation (Inside, Outside, Beginning_of)U.N. I-ORGofficial OEkeus I-PERheads Ofor OBaghdad I-LOC
. O
– Translations between the two conventions are straight-forward
![Page 51: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/51.jpg)
51(109)
Features
• Feature selection – the most difficult part• Some automatic scoring methods can be used
• Document structure– Original markup
– Paragraph/sentence structure
• Surface features– Token length
– Capitalisation
– Token type (word, punctuation, symbol)
• Linguistic features– POS– Morphology– Syntax– Lexicon data
• Semantic features– Ontological class
• ETC– …
![Page 52: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/52.jpg)
52(109)
Mixed Initiative Learning
• Human – computer interaction
• Speeds up the creation of training data
• Can be used for corpus/system creation
• Example implementations:– Alembic [Day et al’97]– Amilcare [Ciravegna’03]
![Page 53: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/53.jpg)
53(109)
Mixed Initiative Learning (2)
P>t 1
P>t 2
User annotates System learns
System annotates User corrects System learns
System annotates
![Page 54: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/54.jpg)
54(109)
GATE Machine Learning support
• Uses classification.
[Attr1, Attr2, Attr3, … Attrn] Class• Classifies annotations.
(Documents can be classified as well using a 1-to1 relation with annotations.)
• Annotations of a particular type are selected as instances.• Attributes refer to features of the instance annotations or
their context.• Generic implementation for attribute collection – can be
linked to any ML engine.• ML engines currently integrated: WEKA and Ontotext’s
HMM.
![Page 55: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/55.jpg)
55(109)
Implementation
Machine Learning PR in GATE.Has two functioning modes:
– training– application
Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?><ML-CONFIG>
<DATASET> … </DATASET><ENGINE>…</ENGINE>
<ML-CONFIG>
![Page 56: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/56.jpg)
56(109)
Attributes Collection
Instances type: Token
![Page 57: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/57.jpg)
57(109)
GATE MLLibrary
Dataflow
NLP Pipeline
TokeniserGazetteer
POS TaggerLexicon LookupSemantic Tagger
etc…
MachineLearningEngine
Fe
atu
reC
olle
cti
on
Re
su
lts
Co
nv
erte
r
Annotateddocuments
EngineInterface
Plain textdocuments
![Page 58: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/58.jpg)
58(109)
Amilcare & Melita
• Amilcare: rule-learning algorithm– Tagging rules – learn to insert tags in the text, given
training examples– Correction rules – learn to move already inserted tags to
their correct place in the text
• Novel aspect: learns independently begin and end tags
• Melita support adaptive IE• Applied in SemWeb context (see below)• Being extended as part of the EU-funded DOT.KOM
project towards KM andSemWeb applications
[Ciravegna’03]www.dcs.shef.ac.uk/~fabio
![Page 59: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/59.jpg)
59(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 60: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/60.jpg)
60(109)
Towards Semantic Tagging of Entities
• The MUC NE task tags selected segments of text whenever that text represents the name of an entity.
• Semantic tagging - view as mentions of the underlying instances from the ontology
• Identify which mentions in the text refer to which instances in the ontology, e.g., – Tony Blair, Mr. Blair, he, the prime minister, he– Gordon Brown, he, Mr. Brown, the chancellor
![Page 61: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/61.jpg)
61(109)
Tasks
• Identify entity mentions in the text
• Reference disambiguation– Add new instances if needed– Disambiguate wrt instances in the ontology
• Identify instances of attributes and relations– take into account what are allowed given the
ontology, using domain&range as constraints
![Page 62: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/62.jpg)
62(109)
ExampleXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …
Ontology & KB
Company
type
HQ
establOn
City Country
Location
partOf
type
type type
“03/11/1978”
XYZ
London
UK Bulgaria
HQpartOf
![Page 63: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/63.jpg)
63(109)
EntityPerson
…
Job-title
president
chancellorminister
…
G.Brown
“Gordon Brown met George Bush during his two day visit.
Classes, instances & metadata
Classes+instances before
Bush
<metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>
<class>…#Person</class> <inst>…#Person12345</inst>
</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string>
<class>…#Person</class> <inst>…#Person67890</inst>
</Annotation></metadata>
Classes+instances after
![Page 64: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/64.jpg)
64(109)
EntityPerson
…
Job-title
president
chancellorminister
…
T. Blair
“Gordon Brown met Tony Blair to discuss the university tuition fees.
Classes, instances & metadata (2)
Classes+instances before
G. Brown
<metadata> <DOC-ID>http://… 2.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>
<class>…#Person</class> <inst>…#Person12345</inst>
</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 30 </e_offset> <string>Tony Blair</string>
<class>…#Person</class> <inst>…#Person26389</inst>
</Annotation></metadata>
Classes+instances after
G. Bush
![Page 65: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/65.jpg)
65(109)
Why not put metadata in ontologies?
• Can be encoded in RDF/OWL, etc. but does it need to be put as instances in the ontology?
• Typically we do not need to reason with it– Reasoning happens in the ontology when the new instances of
classes and properties are added, but the metadata statements are different from them, they only refer to them
• A lot more metadata than instances– Millions of metadata statements, thousands of instances, hundreds
of concepts
• Different access required:– By offset (give me all metadata of the first paragraph)– Efficient metadata-wide statistics based on strings – not an
operation that people would do on other concepts– Mixing with keyword-based search using IR-style indexing
![Page 66: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/66.jpg)
66(109)
Metadata Creation with IE
• Semantic tagging creates metadata• Stand-off or part of document• Semi-automatic
– One view (given by the user, one ontology)– More reliable
• Automatic metadata creation – Many views – change with ontology, re-train IE engine
for each ontology– Always up to date, if ontology changes– Less reliable
![Page 67: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/67.jpg)
67(109)
Problems with “traditional” IE for metadata creation
• S-CREAM – Semi-automatic CREAtion of Metadata [Handschuh et al’02]
• Semantic tags from IE need to be mapped to instances of concepts, attributes or relations
• Most ML-based IE systems do not deal well with relations, mainly entities
• Amilcare does not handle anaphora resolution, GATE has such component but not used here
• Implemented a discourse model with logical rules – LASIE used discourse model with domain
ontology – problem is robustness and domain portability
![Page 68: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/68.jpg)
68(109)
Example[Handschuh et al’02] S-CREAM, EKAW’02
![Page 69: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/69.jpg)
69(109)
S-CREAM: Discourse Rules
• Rules to attach instances only when the ontology allows that (e.g., prices)
• Attach tag values to the nearest preceding compatible entity (e.g., prices and rooms)
• Create a complex object between two concept instances if they are adjacent (e.g., rate – number followed by currency)
• Experienced users can write new rules
![Page 70: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/70.jpg)
70(109)
Challenges for IE for SemWeb
• Portability – different and changing ontologies• Different text types – structured, free, etc.• Utilise ontology information where available• Train from small amount of annotated text• Output results wrt the given ontology
– bridge the gap demonstrated in S-CREAM
• Learn/Model at the right level – ontologies are hierarchical and data will get sparser
the lower we go
DOT.KOM http://nlp.shef.ac.uk/dot.kom/
![Page 71: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/71.jpg)
71(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 72: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/72.jpg)
72(109)
GATE – Infrastructure for metadata extraction for the SemWeb
• Combines learning and rule-based methods
• Allows combination of IE and IR
• Enables use of large-scale linguistic resources for IE, such as WordNet
• Supports ontologies as part of IE applications - Ontology-Based IE (OBIE)
![Page 73: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/73.jpg)
73(109)
Ontology Management in GATE
![Page 74: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/74.jpg)
74(109)
Information Retrieval
Currently based on the Lucene IR engine – useful for combining semantic and keyword-based search
![Page 75: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/75.jpg)
75(109)
Wor
dNet
sup
port
![Page 76: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/76.jpg)
76(109)
Populating Ontologies with IE
![Page 77: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/77.jpg)
77(109)
Example OBIE Application
• hTechSight project – using Ontology-Based IE for semantic tagging of job adverts, news and reports in chemical engineering domain
• Aim is to track technological change over time through terminological analysis
• Fundamental to the application is a domain-specific ontology
• Terminological gazetteer lists are linked to classes in the ontology
• Rules classify the mentions in the text wrt the domain ontology
• Annotations output into a database or as an ontology
![Page 78: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/78.jpg)
78(109)
![Page 79: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/79.jpg)
79(109)
![Page 80: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/80.jpg)
80(109)
Exported Database
![Page 81: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/81.jpg)
81(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE– Platforms for large-scale processing
• Language Generation
![Page 82: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/82.jpg)
82(109)
Platforms for Large-Scale Metadata Creation
• Allow use of corpus-wide statistics to improve metadata quality, e.g., disambiguation
• Automated alias discovery • Generate SemWeb output (RDF, OWL)• Stand-off storage and indexing of metadata• Use large instance bases to disambiguate to• Ontology servers for reasoning and access• Architecture elements:
– Crawler, onto storage, doc indexing, query, annotators– Apps: sem browsers, authoring tools, etc.
![Page 83: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/83.jpg)
83(109)
SemTag • Lookup of all instances from the ontology (TAP) –
65K instances• Disambiguate the occurrences as:
– One of those in the taxonomy– Not present in the taxonomy
• Not very high ambiguity of instances with the same label in TAP – concentrate on the second problem
• Use bag-of-words approach for disambiguation• 3 people evaluated 200 labels in context – agreed
on only 68.5% - metonymy• Placing labels in the taxonomy is hard
Dill et al, SemTag and Seeker. WWW’03
![Page 84: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/84.jpg)
84(109)
Seeker • High-performance distributed infrastructure
• 128 dual-processor machines with separate ½ terabyte of storage
• Each node runs approx. 200 documents per sec.
• Service-oriented architecture – Vinci (SOAP)
Dill et al, SemTag and Seeker. WWW’03
![Page 85: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/85.jpg)
85(109)
OBIE in KIM
Popov et al. KIM. ISWC’03
• The ontology (KIMO) and 86K/200K instances KB• High ambiguity of instances with the same label –
need for disambiguation step• Lookup phase marks mentions from the ontology• Combined with rule-based IE system to recognise
new instances of concepts and relations• Special KB enrichment stage where some of these
new instances are added to the KB• Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris)
![Page 86: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/86.jpg)
86(109)
OBIE in KIM (2)
Popov et al. KIM. ISWC’03
![Page 87: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/87.jpg)
87(109)
Comparison between SemTag and KIM
• SemTag only aims for accuracy (precision) of classification of the annotated entities
• KIM also aims for coverage (recall) – whether all possible mentions of entities were found
• Trade-off – sometimes finding some is enough
• SemTag does not attempt to discover and expand the KB with new instances (e.g., new company) – the reason why KIM uses IE, not simple KB lookup
• i.e. OBIE is often needed for ontology population, not just metadata creation
![Page 88: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/88.jpg)
88(109)
Two Annotation Scenarios (1)
• Getting the instances and the relations between them is enough, maybe not all mentions in the text are covered, but compensated by giving access to this info from the annotated text
![Page 89: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/89.jpg)
89(109)
EntityPerson
…
Job-title
president
chancellorminister
…
G.Brown
“Gordon Brown met president Bush during his two day visit. Afterwards George Bush said…
Example
EntityPerson
…
Job-title
president
chancellorminister
…
G.BrownBush
BenchmarkThe system
Score: 100%
Bush
![Page 90: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/90.jpg)
90(109)
Two Annotation Scenarios (2)
• Exhaustive annotation is required, so all occurrences of all instances and relations are needed
• Allows sentence and paragraph-level exploration, rather than document-level as in the previous scenario
• Harder to achieve
• Distinction between these scenarios needs to be made in the metadata annotation tools/KM tools using IE
![Page 91: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/91.jpg)
91(109)
“Gordon Brown met president Bush during his two day visit. Afterwards George Bush said…
Example
<metadata> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset>
<class>…#Person</class> <inst>…#Person12345</inst>
</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset>
<class>…#Person</class> <inst>…#Person1267</inst>
</Annotation> <Annotation> <s_offset> 61 </s_offset> <e_offset> 72 </e_offset>
<class>…#Person</class> <inst>…#Person1267</inst>
</Annotation></metadata>
<metadata> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset>
<class>…#Person</class> <inst>…#Person12345</inst>
</Annotation> <Annotation> <s_offset> 61 </s_offset> <e_offset> 72 </e_offset>
<class>…#Person</class> <inst>…#Person1267</inst>
</Annotation></metadata>
Score: 66%
![Page 92: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/92.jpg)
92(109)
Semantic Reference Disambiguation
• Possible approaches:– Vector-space models – compare context
similarity – runs over a corpus• SemTag• Bagga’s cross-document coreference work
– Communities of practise approach from KM– Identity criteria from the ontology based on
properties, e.g., date_of_birth, name
![Page 93: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/93.jpg)
93(109)
Why disambiguation is hard – not all knowledge is explicit in text
Paris fashion week underway as cancellations continueBy Jo Johnson and Holly Finn - Oct 07 2001 18:48:17 (FT)
Even as Paris fashion week opened at the weekend, the cancellations and reschedulings were still trickling in over the fax machines: Loewe, the leather specialists owned by LVMH empire, is not showing, Cerruti, the Italian tailor,is downscaling to private viewings, Helmut Lang, master of the sharp suit, is cancelling his catwalk.
The Oscar de la Renta show, for example, which had been planned for September 11th in New York, and which might easily enough have moved over to Paris instead, is not on the schedule. When the Dominican Republic-born designer consulted America Vogue's influential editor, Anna Wintour, she reportedly told him it would be unpatriotic to decamp.
![Page 94: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/94.jpg)
94(109)
Structure of the Tutorial
• Information Extraction - definition• Evaluation – corpora & metrics• IE approaches – some examples
– Rule-based approaches– Learning-based approaches
• Semantic Tagging– Using “traditional” IE– Ontology-based IE – Platforms for large-scale processing
• Language Generation
![Page 95: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/95.jpg)
95(109)
Natural Language Generation
• NLG is:– “subfield of AI and CL that is concerned with
the construction of computer systems that can produce understandable texts in English or other human languages from some underlying linguistic representation of information” [Reiter&Dale’97]
– NLG techniques are applied also for producing speech, e.g., in speech dialogue systems
![Page 96: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/96.jpg)
96(109)
Natural Language Generation
Text
Ontology/KB/Database Lexicons +Grammars
![Page 97: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/97.jpg)
97(109)
Requirements Analysis
• Create a corpus of target texts and (if possible) their input representations
• Analyse the information content– Unchanging texts: thank you, hello, etc.– Directly available data: timetable of buses – Computable data: number of buses – Unavailable data: not in the system’s KB/DB
![Page 98: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/98.jpg)
98(109)
NLG Tasks
1. Content determination
2. Discourse planning
3. Sentence aggregation
4. Lexicalisation
5. Referring expression generation
6. Linguistic realisation
![Page 99: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/99.jpg)
99(109)
Content determination
• What information to include in the text – filtering and summarising input data into a formal knowledge representation
• Application dependent
• Example [ project: AKT
start_date: October-2000
end_date: October-2006
participants: {A,E,OU,So,Sh}]
![Page 100: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/100.jpg)
100(109)
Discourse Planning
• Determine ordering and structure over the knowledge to be generated
• Theories of discourse – how texts are structured
• Influences text readability
• Result: tree structure imposing ordering over the predicates and possibly providing discourse relations
![Page 101: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/101.jpg)
101(109)
Example
Root
Project participants
[project:AKT duration: 6 yrs]
[project: AKT participant:Shef]
Participant descr.
[univ: ShefWeb-page: URL]
Participant descr.
[project: AKT participant:OU]
…
SEQUENCE
LIST
ELABORATION ELABORATION
…
![Page 102: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/102.jpg)
102(109)
Planning-Based Approaches
• Use AI-style planners (e.g., [Moore & Paris 93]– Discourse relations (e.g., ELABORATION) are
encoded as planning operators– Preconditions specify when the relation can
apply– Planning starts from a top-level goal, e.g.,
define-project(X)
• Computationally expensive and require a lot of knowledge – problem for real-world systems
![Page 103: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/103.jpg)
103(109)
Schema-Based Approaches
• Capture typical text structuring patterns in templates (derived from corpus), e.g., [McKeown 85]
• Typically implemented as RTN• Variety comes from different available
knowledge for each entity• Reusable ones available: Exemplars • Example:
Describe-Project-Schema -> Sequence([duration], ProjParticipants-
Schema)
![Page 104: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/104.jpg)
104(109)
Sentence Aggregation
• Determine which predicates should be grouped together in sentences
• Less understood process• Default: each predicate can be expressed as a
sentence, so optional step• SPOT: trainable planner• Example:
AKT is a 6-year project with 5 participants: • Sheffield (URL)• OU …
![Page 105: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/105.jpg)
105(109)
Lexicalisation
• Choosing words and phrases to express the concepts and relations in predicates
• Trivial solution: 1-1 mapping between concepts/relations and lexical entries
• Variation is useful to avoid repetitiveness and also convey pragmatic distinctions (e.g. formality)
![Page 106: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/106.jpg)
106(109)
Referring Expression Generation
• Choose pronouns/phrases to refer to the entities in the text
• Example: he vs Mr Smith vs John Smith, the president of XXX Corp.
• Depends on what is previously said– He is only appropriate if the person is already
introduced in the text
![Page 107: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/107.jpg)
107(109)
Linguistic Realisation
• Use grammar to generate text which is grammatical, i.e., syntactically and morphologically correct
• Domain-independent
• Reusable components are available – e.g., RealPro, FUF/SURGE
• Example: – Morphology: participant -> participants– Syntactic agreement: AKT starts on …
![Page 108: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/108.jpg)
108(109)
A GATE-based generator
• Input– The MIAKT ontology– The RDF file for the given case– The MIAKT lexicon
• Output– GATE document with the generated
text
![Page 109: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/109.jpg)
109(109)
Lexicalising Concepts and Instances
![Page 110: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/110.jpg)
110(109)
Example RDF Input<rdf:Description rdf:about='c:\breast_cancer_ontology.daml#01401_patient'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Patient'/><NS2:has_age>68</NS2:has_age><NS2:involved_in_ta rdf:resource='c:\breast_cancer_ontology.daml#ta-soton-1069861276136'/>
</rdf:Description><rdf:Description rdf:about='c:\breast_cancer_ontology.daml#01401_mammography'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Mammography'/><NS2:carried_out_on rdf:resource='c:\breast_cancer_ontology.daml#01401_patient'/><NS2:has_date>22 9 1995</NS2:has_date><NS2:produce_result rdf:resource='c:\breast_cancer_ontology.daml#image_01401_right_cc'/>
</rdf:Description><rdf:Description rdf:about='c:\breast_cancer_ontology.daml#image_01401_right_cc'>
<NS2:image_file>cancer/case0140/C_0140_1.RIGHT_CC.LJPEG</NS2:image_file><rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Right_CC_Image'/><NS2:has_lateral rdf:resource='c:\breast_cancer_ontology.daml#lateral_right'/><NS2:view_of_image rdf:resource='c:\breast_cancer_ontology.daml#craniocaudal_view'/><NS2:contains_entity rdf:resource='c:\breast_cancer_ontology.daml#01401_right_cc_abnor_1'/>
</rdf:Description><rdf:Description rdf:about='c:\breast_cancer_ontology.daml#01401_right_cc_abnor_1'>
<rdf:type rdf:resource='c:\breast_cancer_ontology.daml#Abnormality'/><NS2:is_finding rdf:resource='c:\breast_cancer_ontology.daml#mass_01401_right_cc_abnor_1'/><NS2:has_morph_feature rdf:resource='c:\breast_cancer_ontology.daml#shape_mammo_round'/><NS2:has_morph_feature rdf:resource='c:\breast_cancer_ontology.daml#margin_mammo_microlobulated'/><NS2:has_overall_impression rdf:resource='c:\breast_cancer_ontology.daml#assessment_probably_malignant'/>
</rdf:Description>
![Page 111: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/111.jpg)
111(109)
CASE0140.RDF
The 68 years old patient is involved in a triple assessment procedure. The triple assessment procedure contains a mammography exam. The mammography exam is carried out on the patient on 22 9 1995. The mammography exam produced a right CC image. The right CC image contains an abnormality and it has a right lateral side and a craniocaudal view. The abnormality has a mass, a microlobulated margin , a round shape, and a probably malignant assessment.
![Page 112: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/112.jpg)
112(109)
Further Reading on IE for SemWeb• Requirements for Information Extraction for Knowledge Management.
http://nlp.shef.ac.uk/dot.kom/publications.html • Information Extraction as a Semantic Web Technology: Requirements
and Promises. Adaptive Text Extraction and Mining workshop, 2003. • A. Kiryakov, B. Popov, et al. Semantic Annotation, Indexing, and
Retrieval. 2nd International Semantic Web Conference (ISWC2003), http://www.ontotext.com/publications/index.html#KiryakovEtAl2003
• S. Handschuh, S. Staab, R. Volz: http://www.aifb.uni-karlsruhe.de/WBS/sha/papers/p273_handschuh.pdf. On Deep Annotation. WWW’03.
• S. Dill, N. Eiron, et al: http://www.tomkinshome.com/papers/2Web/semtag.pdf . SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. WWW’03.
• E. Motta, M. Vargas-Vera, et al: MnM: Ontology Driven Semi-Automatic and Automatic Support for Semantic Markup. : Knowledge Engineering and Knowledge Management (Ontologies and the Semantic Web), (EKAW02), http://www.aktors.org/publications/selected-papers/06.pdf
• K. Bontcheva, A. Kiryakov, H. Cunningham, B. Popov. M. Dimitrov. Semantic Web Enabled, Open Source Language Technology. Language Technology and the Semantic Web, Workshop on NLP and XML (NLPXML-2003). http://www.gate.ac.uk/sale/eacl03-semweb/bontcheva-etal-final.pdf
• Handschuh, Staab, Ciravegna. S-CREAM - Semi-automatic CREAtion of Metadata (2002) http://citeseer.nj.nec.com/529793.html
![Page 113: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/113.jpg)
113(109)
Further Reading on “traditional” IE• [Day et al’97] D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language Processing Systems. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’97). 1997.
• [Ciravegna’02] F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks: User-System Cooperation in Document Annotation based on Information Extraction. Knowledge Engineering and Knowledge Management (Ontologies and the Semantic Web), (EKAW02), 2002.
• N. Kushmerick, B. Thomas. Adaptive information extraction: Core technologies for information agents (2002). http://citeseer.nj.nec.com/kushmerick02adaptive.html
• H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). 2002.
• D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.
• Califf and Mooney: Relational Learning of Pattern Matching Rules for Information Extraction http://citeseer.nj.nec.com/6804.html
• Borthwick. A. A Maximum Entropy Approach to Named Entity Recognition.PhD Dissertation. 1999
• Bikel D., Schwarta R., Weischedel. R. An algorithm that learns what’s in a name. Machine Learning 34, pp.211-231, 1999
• Riloff, E. (1996) "Automatically Generating Extraction Patterns from Untagged Text" Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96) , 1996, pp. 1044-1049. http://www.cs.utah.edu/%7Eriloff/psfiles/aaai96.pdf
• Daelemans W. and Hoste V. Evaluation of Machine Learning Methods for Natural Language Processing Tasks. In LREC 2002 Third International Conference on Language Resources and Evaluation, pages 755–760
![Page 114: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/114.jpg)
114(109)
Further Reading on “traditional” IE• Black W.J., Rinaldi F., Mowatt D. Facile: Description of the NE System Used
For MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.
• Collins M., Singer Y. Unsupervised models for named entity classificationIn Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999
• Collins M. Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron. Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, pp. 489-496, July 2002 Gotoh Y., Renals S. Information extraction from broadcast news, Philosophical Transactions of the Royal Society of London, series A: Mathematical, Physical and Engineering Sciences, 2000.
• Grishman R. The NYU System for MUC-6 or Where's the Syntax? Proceedings of the MUC-6 workshop, Washington. November 1995.
• Krupka G. R., Hausman K. IsoQuest Inc.: Description of the NetOwlTM Extractor System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998.
• McDonald D. Internal and External Evidence in the Identification and Semantic Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996
• Mikheev A., Grover C. and Moens M. Description of the LTG System Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998
• Miller S., Crystal M., et al. BBN: Description of the SIFT System as Used for MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1 May, 1998
![Page 115: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/115.jpg)
115(109)
Further Reading on multilingual IE• Palmer D., Day D.S. A Statistical Profile of the Named Entity Task.
Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., March 31- April 3, 1997.
• Sekine S., Grishman R. and Shinou H. A decision tree method for finding and classifying names in Japanese texts. Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada, 1998
• Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N. Chinese Named Entity Identification Using Class-based Language Model. In proceeding of the 19th International Conference on Computational Linguistics (COLING2002), pp.967-973, 2002.
• Takeuchi K., Collier N. Use of Support Vector Machines in Extended Named Entity Recognition. The 6th Conference on Natural Language Learning. 2002
• D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic extraction of named entities. Recent Advances in Natural Language Processing, Bulgaria, 2003.
• M. M. Wood and S. J. Lydon and V. Tablan and D. Maynard and H. Cunningham. Using parallel texts to improve recall in IE. Recent Advances in Natural Language Processing, Bulgaria, 2003.
• D.Maynard, V. Tablan and H. Cunningham. NE recognition without training data on a language you don't speak. ACL Workshop on Multilingual and Mixed-language Named Entity Recognition: Combining Statistical and Symbolic Models, Sapporo, Japan, 2003.
![Page 116: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/116.jpg)
116(109)
Further Reading on multilingual IE• H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, Y.
Wilks. Multimedia Indexing through Multisource and Multilingual Information Extraction; the MUMIS project. Data and Knowledge Engineering, 2003.
• D. Manov and A. Kiryakov and B. Popov and K. Bontcheva and D. Maynard, H. Cunningham. Experiments with geographic knowledge for information extraction. Workshop on Analysis of Geographic References, HLT/NAACL'03, Canada, 2003.
• H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.
• H. Cunningham. GATE, a General Architecture for Text Engineering. Computers and the Humanities, volume 36, pp. 223-254, 2002.
• D. Maynard, H. Cunningham, K. Bontcheva, M. Dimitrov. Adapting A Robust Multi-Genre NE System for Automatic Content Extraction. Proc. of the 10th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA 2002), 2002.
• K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is the reuse of grammars for Named Entity Recognition? Language Resources and Evaluation Conference (LREC'2002), 2002.
![Page 117: Metadata Extraction: Human Language Technology and the Semantic Web](https://reader030.vdocuments.us/reader030/viewer/2022013101/56815147550346895dbf689f/html5/thumbnails/117.jpg)
117(109)
THANK YOU!
The slides:http://gate.ac.uk/sale/talks/sekt-tutorial.ppt